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Abstract.  The  CRCW  PRAM  under  dynamic  fail-stop  (no  restart)  processor  behavior  is 
a  fault-prone  multiprocessor  mode!  for  which  it  is  possible  to  both  guarantee  reliability  and 
preserve  efficiency.  To  handle  dynamic  faults  some  redundancy  is  necessary  in  the  form  of 
many  processors  concurrently  performing  a  common  read  or  write  task.  In  this  paper  we  show 
how  to  significantly  decrease  this  concurrency  by  bounding  it  in  terms  of  the  number  of  actual 
processor  faults.  We  describe  a  low  concurrency,  efficient  and  fault-tolerant  algorithm  for  the 
Write- All  primitive:  “using  <  N  processors,  write  I’s  into  N  locations”.  This  primitive 
can  serve  as  the  basis  for  efficient  fault-tolerant  simulations  of  algorithms  written  for  fault- 
free  PRAMs  on  fault-prone  PRAMs.  For  any  dynamic  failure  pattern  F,  our  algorithm  has 
total  write  concurrency  <  |F|  and  total  read  concurrency  <  7  |F[iogA,  where  |F|  is  the 
number  of  processor  faults  (for  example,  there  is  no  concurrency  in  a  run  without  failures); 
note  that,  previous  algorithms  used  ft  (A'  log  Ar)  concurrency  even  in  the  absence  of  faults. 
We  also  describe  a  technique  for  limiting  the  per  step  concurrency  and  present  an  optimal 
fault -tolerant  EREW  PRAM  algorithm  for  Write-All,  when  all  processor  faults  are  initial. 
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1.  Introduction 

1.1.  Motivation  and  Background 

A  basic  problem  of  massively  parallel  computing  is  that  the  unreliability  of  inexpensive 
processors  and  their  interconnection  may  eliminate  any  potential  efficiency  advantage  of 
parallelism.  Our  research  is  an  investigation  of  fault  models  and  parallel  computation 
models  under  which  it  is  possible  to  achieve  algorithmic  efficiency  (i.e.,  speed-ups  close  to 
linear  in  the  number  of  processors)  despite  the  presence  of  faults. 

Such  combinations  of  fault  and  computation  models  are  interesting  because  they  il¬ 
lustrate  good  trade-offs  between  reliability  and  efficiency.  To  see  that  there  is  a  trade-off 
consider  that  reliability  usually  requires  adding  redundancy  to  the  computation  in  order  to 
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detect  errors  and  reassign  resources,  whereas  gaining  efficiency  by  rhasslvely  parole’  com¬ 
puting  requires  removing  redundancy  from  the  computation  to  fully  utilize  each  processor. 

Even  allowing  for  some  abstraction  in  the  model  of  parallel  computation,  it  is  not  obvious 
that  there  are  any  non-trivia!  fault  models  that  allow  near-linear  speed-ups.  So  it  was 
somewhat  surprising  when  in  [18]  we  demonstrated  that  it  is  possible  to  combine  efficiency 
and  fault-tolerance  for  many  basic  algorithms  expressed  as  concurrei.*-read  concurrent-write 
parallel  random  access  machines  (CRCW  PRAMs).  The  [18]  fault  model  allows  any  pattern 
of  dynamic  fail-stop  no  restart  processor  errors,  as  long  as  one  processor  remains  alive. 
Interestingly,  our  technique  can  be  extended  to  all  CRCW  PRAMs.  In  [20]  it  is  shown  how 
to  simulate  any  fault-free  PRAM  on  a  CRCW  PRAM  that  executes  in  a  fail-stop  no- restart 
environment  with  comparable  efficiency  and  in  [31]  it  is  shown  how  the  simulation  can  be 
made  optimal  using  processor  slackness.  The  fault  model  was  extended  in  [32]  to  include 
arbitrary  static  memory  faults ,  i.e.,  arbitrary  memory  initiadization.  The  CRCW  PRAM 
computation  model  is  a  widely  used  abstraction  of  real  massively  parallel  machines.  As 
shown  in  [18],  it  suffices  to  consider  COMMON  CRCW  PRAMs  (all  concurrent  writes  are 
identical)  in  which  the  atomically  written  words  need  only  contain  a  constant  number  of 
bits. 

All  the  above-mentioned  algorithmic  work  makes  two  somewhat  unrealistic,  but  critical, 
assumptions.  It  is  assumed  that:  (1)  processors  can  read  and  write  memory  concurrently, 
and  (2)  processor  faults  are  fail-stop  without  restarts.  This  paper  is  an  investigation  of 
assumption  (1)  about  the  model  of  computation,  while  still  making  assumption  (2)  about 
the  fault  model.  Specifically,  the  redundancy  which  is  essential  for  the  above  algorithmic 
techniques  seems  to  be  the  result  of  more  than  one  processor  concurrently  performing  a 
common  read  or  write  task.  Our  contribution  here  is  to  show  how  to  control  and  minimize 
this  redundancy. 

Our  analysis  is  deterministic  with  worst-case  failure-inducing  adversaries.  Fot  this  we 
still  assume  (2)  above,  namely  that  the  dynamic  processor  errors  do  not  affect  shared 
memory  and  processors  once  stopped  do  not  resume  computation.  More  general  processor 
asynchrony  has  been  examined  in  [2,  3,  4,  7,  8,  9,  11,  14,  20,  21,  22,  25,  26,  27],  Many 
of  these  analyses  involve  average  processor  behavior  and  use  randomization.  Some  (e,g.. 
[2.  7.  22])  deal  with  deterministic  asynchronous  computation  (i.e.,  removing  assumption  (2) 
without  using  randomness).  In  this  case  the  strongest  lower  bounds  are  in  [22]  and  the 
tightest  upper  bounds  are  in  [2]. 

A  key  primitive  in  much  of  the  above  mentioned  work  is  the  Write- All  operation  of  [18]. 
The  main  technical  insight  in  this  paper  is  an  efficient,  deterministic  fault-tolerant  Write- All 
algorithm  that  significantly  reduces  memory  access  concurrency. 

The  Write-All  problem  is:  using  P  processors  write  Is  into  all  locations  of  an  array  of 
size  N .  where  P  <  N.  When  P  =  N  this  operation  captures  the  computational  progress 
that  can  be  naturally  accomplished  in  one  time  unit  by  a  PRAM.  In  the  fail-stop  PRAM 
model  the  P  processors  share  a  global  clock  and  are  fail-stop  no  restart. 

Under  dynamic  failures,  efficient  deterministic  solutions  to  Write-All,  i.e.,  increasing 
the  fault-free  O(N)  work  by  small  polylog(A’’)  factors,  are  non-obvious.  The  first  such 
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solution  was  algorithm  W  of  [18]  which  has  (to  date)  the  best  worst-case  work  bound 
0(N  +  P  log2  N l  log  log  N )  for  l  <  P  <  N  (this  bound  was  shown  in  [22]  for  a  variant 
of  algorithm  W  and  in  [24]  the  same  bound  was  shown  for  algorithm  W).  Algorithm  W 
was  extended  to  handle  arbitrary  initial  memory  in  [32]  and  served  as  a  building  block  for 
transforming  fault-free  PRAMs  into  fault-prone  PRAMs  of  comparable  efficiency  in  [20, 31]. 

We  say  that  Write- All  completes  at  the  global  clock  tick  at  which  all  the  processors 
that  have  not  fail-stopped  share  the  knowledge  that  1  Is  have  been  written  into  all  N  array 
locations.  Requiring  completion  of  a  Write- All  algorithm  is  critical  if  one  wishes  to  iterate 
it.  as  pointed  out  in  [20]  which  uses  a  certification  bit  to  separate  the  various  iterations 
of  (Certified)  Write-All.  Note  that  the  Write-All  completes  when  all  processors  halt  in 
algorithm  W.  This  is  also  the  case  for  the  algorithms  presented  here. 

In  Section  2.1  we  present  our  fault  and  computation  models.  In  Section  2.2  we  review 
our  measures  of  efficiency  and  introduce  a  measure  of  concurrency.  To  make  the  exposition 
seif  contained  we  present  in  Section  2.3  an  outline  of  the  main  algorithmic  ideas  of  algorithm 
W  and  sketch  how'  it  is  utilized  in  PRAM  simulation  and  memory  clearing. 

Memory  access  concurrency  is  the  main  topic  of  this  paper.  In  [18]  it  is  shown  that  read 
and  write  concurrency  are  necessary  for  deterministic  ei" cient  solutions  under  dynamic 
processor  faults.  All  Write- All  solutions  that  have  efficient  work  for  any  dynamic  processor 
failures  require  concurrent-read  concurrent- write  (CROW)  PRAMs.  Moreover,  all  such 
solutions  to  date  make  very  liberal  use  of  concurrent  reads  and  writes. 

1.2.  Summary  of  Results 

In  Section  3  we  present  our  main  contribution,  that  we  can  bound  the  total  amount  of 
memory  access  concurrency  in  terms  of  the  number  of  dynamic  processor  faults  of  the  actual 
run  of  the  algorithm. 

When  there  are  no  faults  our  new  algorithm  executes  as  an  EREW  PRAM  and  when 
there  are  faults  the  algorithm  differs  from  an  EREW  PRAM  in  an  amount  of  concurrency 
related  to  the  number  of  faults.  The  algorithm  is  based  on  a  conservative  policy  towards 
concurrency:  concurrent  reads  or  writes  occur  only  when  the  presence  of  failures  can  be 
inferred  from  what  processors  know  (i.e.,  when  failures  are  detected)  and  these  concurrent 
operations  happen  in  proportion  to  the  failures  detected. 

We  describe  an  efficient  fault-tolerant  CRCW  algorithm  for  the  Write- All  problem  such 
that,  for  any  failure  pattern  F  in  a  run  of  the  algorithm  the  total  write  concurrency  is 
<  \F\  and  the  total  read  concurrency  is  <  7  [FjloglV,  where  [Fj  is  the  number  of  failures. 
The  algorithm  is  based  on  algorithm  W  and  makes  use  of  processor  identifiers  to  construct 
mergeable  processor  priority  trees,  which  control  concurrent  access  to  memory.  We  present 
it  in  two  steps,  first  controlled  write  concurrency  (Section  3.1)  and  then  controlled  read 
concurrency  (Section  3.2).  The  work  of  the  Write- All  algorithm  described  in  Section  3.2  is 
Oi  .V  log  A”  log  P  +  P  log  P  log2  N  l  log  log  N),  where  1<P<N. 

In  Section  3.3  we  utilize  parallel  slackness  to  obtain  a  Write- All  algorithm  that  has  opti¬ 
mal  work  for  a  non-trivia]  range  of  processors.  The  algorithm  has  work  0(N  +  — ) 


and  is  optimal  for  flog2  P  <  A  log  log  A/ log2  A.  We  extend  this  algorithm  to  handle  ar¬ 
bitrary  initial  memory  contents  in  Section  3.4.  in  Section  3.5  we  demonstrate  that  there  is 
no  efficient  and  fault-tolerant  algorithm  whose  totaTwrite  concurrency  is  bounded  by  jF|£ 
for  0  <  e  <  1.  In  Section  3.6  we  describe  how  this  algorithm  can  be  used  as  a  building 
block  for  efficient  simulations  of  arbitrary  PRAMs.  This  is  a  dynamic  version  of  Brent’s 
lemma  [6]. 

In  Section  4  we  examine  the  worst-case  concurrency  per  step.  This  is  a  different  measure 
than  the  overall  concurrency  of  Section  3.  We  present  a  variation  of  algorithm  W  whose 
maximum  read/write  concurrency  per  step  is  Ar/polylog(A)  instead  of  A.  The  point  here 
is  decreasing  concurrency  per  step  by  polylog  amounts  using  a  general  pipelining  , technique 
of  synchronizing  processors  into  waves.  This  idea  can  be  combined  with  those  of  Section  3. 

The  case  of  static  faults  is  examined  in  Section  5.  Using  a  variant  of  the  parallel  prefix 
operation,  we  give  a  0(Ar  +  P'  log  P)-work  EREW  algorithm  for  Write- All,  where  P'  is 
the  number  of  live  processors  and  P  -  P'  is  the  number  of  initial  faults.  For  P  -  A,  this 
solution  is  optimal  (based  on  a  result  of  [5].  For  general  simulations  of  PRAM  algorithms 
our  solution  introduces  a  ©(P'log  P)  additive  overhead.  It  also  handles  static  memory 
faults  at'  no  additional  overhead.  In  summary,  the  static  case  has  a  simple  optimal  solution 
which  eliminates  memory  access  concurrency. 

We  conclude  with  some  open  problems  in  Section  6, 

2.  The  Model  and  Base  Algorithms 

2.1.  Fail-Stop  PRAM 

The  parallel  random  access  machine  (PRAM)  of  Fortune  and  Wyllie  [13]  combines  the 
simplicity  of  RAM  with  the  power  of  parallelism,  and  a  wealth  of  efficient  algorithms  exist  for 
it:  see  surveys  [12,  19]  for  the  rationale  behind  this  model  and  the  fundamental  algorithms. 
The  main  features  of  this  model  are  as  follows: 

1.  There  are  P  initial  processors  with  unique  identifiers  (PID)  in  the  range  1,...,P. 
Each  processor  knows  its  PID  and  the  number  of  processors  P. 

2.  The  memory  accessible  to  all  processors  is  denoted  as  shared ,  each  processor  also 
has  a  constant  size  local  memory  denoted  as  private.  Each  memory  cell  can  store 
0(logmax{Ar,  P})  bits  on  inputs  of  size  A. 

3.  The  input  is  stored  in  A  cells  in  shared  memory  and  the  rest  of  the  shared  memory 
is  cleared  (i.e.,  contains  zeroes).  The  processors  know  the  input  size  A. 

In  the  study  of  fail-stop  PRAMs,  this  base  model  is  extended  with  a  failure  model: 

4.  We  allow  any  dynamic  pattern  F  of  processor  fail-stop  errors  provided  one  processor 
survives  (one  processor  is  necessary  if  anything  is  to  be  done).  F  describes  which 
processors  fail  and  when.  This  pattern  is  determined  by  an  adversary,  who  knows 
everything  about  the  structure  and  the  dynamic  behavior  of  the  algorithm. 
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Fig.  1.  An  architecture  for  a  fail-stop  multiprocessor. 


5.  We  consider  fail-stop  (no  restart)  errors:  processors  fail  by  stopping  and  not  per¬ 
forming  any  further  actions;  their  private  memory  is  lost,  but  shared  memory  is  not 
affected  by  processor  failures. 

6.  Shared  memory  writes  are  atomic  with  respect  to  failures:  failures  can  occur  before  or 
after  a  shared  write  of  ©(log  max{  A,  P})-bit  words,  but  not  during  the  write.  (This 
non-triviaJ  assumption  is  made  only  for  simplicity  of  presentation;  algorithms  that 
make  it  can  be  converted  to  use  only  single  bit  atomic  writes  [18].)  We  also  handle 
arbitmry  initialization  of  the.  shared  memory ,  which  is  assumed  clear  in  the  fault-free 
PRAM  model.  (This  extends  3  above.) 

The  abstract  model  that  we  are  studying  can  be  realized  in  the  architecture  of  Fig.  1. 
Fault-tolerant  technologies,  such  as  those  in  surveys  [10,  15,  16],  all  contribute  towards 
concrete  realizations  of  its  components. 

a.  There  are  P  fail-stop  processors  (see  [29]),  each  with  a  unique  address  and  some  local 
memory. 

b.  There  are  Q  shared  memory  cells,  the  input  of  size  N  <  Q  is  stored  in  shared  memory. 
These  semiconductor  memories  can  be  manufactured  with  built-in  fault  tolerance  us¬ 
ing  replication  and  coding  techniques  without  appreciably  degrading  performance  [28], 

c.  Processors  and  memory  are  interconnected  via  a  synchronous  network  (e.g.,  as  in  the 
Ultracomputer  [30]).  A  combining  interconnection  network  that  is  well  suited  for  im¬ 
plementing  synchronous  concurrent  reads  and  writes  is  studied  in  [23]  (the  combining 
properties  are  used  in  their  simplest  form  only  to  implement  concurrent  access  to 
memory).  The  network  can  be  made  more  reliable  by  employing  redundancy  [1], 

2.2.  Measures  of  Efficiency 

The  complexity  measure  used  throughout  this  work  is  the  available  processor  steps 
of  [18].  It  generalizes  the  fault-free  Parallel-time  x  Processors  product  and  accounts  for 
all  steps  performed  by  the  active  processors. 

Definition  2.1.  Consider  a  computation  with  P  initial  processors  that  terminates  in  par¬ 
allel  time  t  after  completing  its  task  on  some  input  data  I  of  size  j/j  =  N,  and  in  the 
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presence  of  some  fail-stop  error  pattern  F  of  size  jFj  <  P.  If  Pi(I,F)  <  P  is  the  number 
of  processors  completing  an  instruction  at  time  i,  we  define  the  available  processor  steps  of 
this  computation  as 

*=i 

and  the  available  processor  steps  S  of  an  algorithm  as 

5  «  S(N,  P)  s  inax{$itFtp  :  j/j  =  N ,  |F{  <  P}. 

□ 

The  notion  of  algorithm  robustness  combines  fault  tolerance  and  efficiency: 

Definition  2.2.  Let  T(N)  be  the  best  sequential  (RAM)  time  bound  known  for  A-size 
instances  of  a  problem.  We  say  that  a  parallel  algorithm  for  this  problem  is  a  robust  parallel 
algorithm  if  for  any  input  1  of  size  |Jj  =  N  and  for  any  number  of  initial  processors  P 
(1  <  P  <  N)  and  for  any  failure  pattern  F  of  size  jFj  <  P,  this  algorithm  completes  its 
task  and  it  has  S  =  S(N,  P)  <  c  T(N)  log0  N,  for  fixed  c,  c'.  □ 

We  now  introduce  new  measures  to  gauge  the  concurrent  memory  accesses  of  a  compu¬ 
tation. 

Definition  2.3.  Consider  a  computation  with  P  initial  processors  that  terminates  in  par¬ 
allel  time  r  after  completing  its  task  on  some  input  data  I  of  size  j/|  =  N  in  the  presence 
of  fail-stop  error  pattern  F  of  size  jF|  <  P.  If  at  time  i  (1  <  i  <  r),  Pj*  processors  com¬ 
plete  reads  from  A’/*  shared  memory  locations  and  P™  processors  complete  writes  to  AT,W 
locations,  then  we  define: 

T 

(i)  the  read  concurrency  p  of  the  computation  a s:  p  =  Pi,f,P  =  ^  (jP^  ~  Ar,R 'j ,  and 

i=l 

T 

(ii)  the  write  concurrency  u  of  the  computation  as:  w  =  ^PtW  —  ^ .  □ 

Remark  1.  For  concurrent  reads  from  (writes  to)  a  particular  memory  location,  the  read 
(write)  concurrency  for  that  location  is  simply  the  number  of  readers  (writers)  minus  one. 
For  example,  if  only  one  processor  reads  from  (writes  to)  to  a  location,  then  the  concurrency 
is  0,  i.e.,  no  concurrency  is  involved.  This  leads  to  an  equivalent  definition  of  p  and  u:  For 
a  given  computation,  let  Rij  be  the  number  of  reads  from  location  i  during  step  j,  and 
H',j  be  the  number  of  writes  to  location  t  during  step  j.  Then  p  for  the  computation  is 
EJ=i  (R,j-l),andu>is  £J=i  It  is  easy  to  see  that  this  definition 

of  p  and  w’  is  equivalent  to  Definition  2.3  above. 

Remark  2.  Note  that  p  and  w  are  defined  for  a  computation,  or  actual  run,  of  an  algorithm. 
They  are  cumulative  measures  over  all  the  steps  of  the  computation.  S  is  the  maximum 
ovct  all  computations  where  |/|  —  N  and  jFj  <  P .  Our  bounds  for  S  are  worst-case  over 
all  inputs  and  failure  patterns,  but  our  bounds  for  p  and  w  depend  on  (F|  and  |/|  =  N . 
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Oi  forall  processors  PJD=1.,1V  par  begin 

02  Phase  VV3:  Visit  the  leaves  based  oa  P1D  to  perform  work  on  the  input  data 
03  Phase  W4:  Traverse  the  progress  tree  bottom  up  to  measure  progress 
04  while  the  root  of  the  progress  tree  is  aoi  N  do 

05  Phase  Wl:  Traverse  the  processor  enumeration  tree  bottom  up  to  enumerate  processors 

06  Phase  W2:  Traverse  the  progress  trev  top  down  to  reschedule  work 

0?  Phase  W3:  Perform  rescheduled  work  on  the  input  data 

08  Phase  W4:  Traverse  the  progress  tree  bottom  up  to  measure  progress 

09  t>d 

10  pareud 


Fig.  2.  A  high  level  view  of  algorithm  W  for  P  —  N . 


Remark  3.  p  and  u  can  be  used  to  characterize  the  type  of  PRAM  that  an  execution  of 
an  algorithm  requires.  An  execution  that  has  p  =  u)  =  0  can  be  carried  out  on  an  EREW 
PRAM  while  a  CREW  PRAM  suffices  for  an  execution  with  a?  ss  0.  This  does  not  imply 
that  the  algorithm  itself  is  an  EREW  or  CREW  algorithm.  However,  an  algorithm  is  EREW 
if  its  worst  case,  p  and  j,’  are  0  and  it  is  CREW  if  its  worst  case  is  0. 

2.3.  Background  Algorithms 
2.3.1.  Algorithm  W 

Algorithm  W  of  [18]  is  a  robust  Write- All  solution.  It  uses  two  complete  binary  trees 
as  data  structures:  the  processor  enumeration  and  the  progress  trees.  A  high  level  view 
of  algorithm  W  is  in  Fig.  2.  It  is  an  iterative  algorithm  in  which  aU  active  processors 
synchronously  execute  the  following  four  phases: 

•  Phase  Wl  — processor  enumeration.  All  processors  traverse  bottom-up  the  processor 
enumeration  tree  always  starting  at  the  same  leaf,  i.e,  this  traversal  is  static.  This 
phase  utilizes  a  version  of  the  standard  parallel  prefix  algorithm.  The  result  is  an 
overestimate  of  the  number  of  live  processors  in  every  subtree  and  a  numbering  of  the 
live  processors. 

•  Phase  W2  —  processor  allocation.  The  processors  traverse  top-down  the  progress  tree 
using  a  divide-and-conquer  approach  (based  on  processor  enumeration  and  progress 
measurement)  to  allocate  themselves  to  the  unvisited  leaves  of  the  progress  tree. 

•  Phase  W3  —  work  phase.  The  processors  work  at  the  leaves  they  reached  in  W2, 

•  Phase  W4  —  progress  measurement.  The  processors  traverse  bottom-up  the  progress 
tree  and  compute  an  underestimate  of  the  progress  of  the  algorithm  for  each  subtree. 
They  start  from  the  leaves  they  were  at  in  phase  W3.  A  version  of  the  standard 
parallel  summation  algorithm  is  employed  to  compute  underestimates  of  the  progress 
for  each  subtree. 
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01  forall  processors  PID=1..P  p&rbegic - P  fail-stop  processors  simulate  N  fault-prone  ones 

02  The  PRAM  program  for  N  processors  is  stored  in  shared  memory  (read-only) 

03  Shared  memory  has  ’two  generations:  current  and  future;  the  rest  of  shared  memory  is  clear 
04  initialise  N  simulated  instruction  counters  to  start  at  the  first  instruction 
OS  while  there  is  a  simulated  processor  that  has  not  halted  do 

06  — Perform  tentative  computation:  Fetch  next  instruction;  Copy  registers  to  scratchpad; 

07  Do  read  cycle  using  “current”  memory;  Perform  the  compute  cycle; 

08  Do  write  cycle  into  the  “future”  memory;  Compute  next  instruction  address 
09  — Reconcile  memory  and  registers.  Copy  “future"  memory  and  registers  to  “current* 

10  od 

11  parewd 

Fig.  3.  Simulations  using  Write- All  primitive  and  two  generations  of  shared  memory. 

The  following  lemma  of  [24]  provides  a  bound  on  the  total  number  of  block  steps  executed 
by  all  processors  (where  a  block  step  is  an  execution  by  one  processor  of  the  body  of  the 
while-loop  in  Fig.  2).  It  will  be  used  in  our  analyses  in  Section  3. 

Lemma  2.1  ([24]).  Let  U  be  the  number  of  unvisited  leaves  of  the  progress  tree  and  P  be 
the  number  of  processors  at  the  beginning  of  algorithm  W.  Let  Ut  and  Pi  be  the  number 
of  unvisited  leaves  of  the  progress  tree  and  processors,  respectively,  at  the  start  of  the  ith 
iteration  of  the  while  loop.  Then  the  total  number  of  block  steps  of  all  iterations  for  which 
Pi  <  V;  is  at  most  U  and  the  total  number  of  block  steps  of  iterations  for  which  Pi  >  {/,•  is 
at  most  OiP'fcgft—i j).  The  overall  number  of  block  steps  is  0(U  +  P  Sgftgu)-  a 

Algorithm  W  can  be  parameterized  by  clustering  elements  of  the  input  array  and  as¬ 
signing  one  cluster  to  each  leaf  of  the  progress  tree.  We  will  use  the  notation  W[s]  to  specify 
that  the  input  elements  form  clusters  of  size  s;  in  this  case  the  progrees  tree  has  N/s  leaves, 
one  for  each  cluster.  When  using  P  processors  such  that  P  >  j,  it  is  sufficient  for  each 
processor  to  take  its  PID  modulo  y  to  assure  a  uniform  initial  assignment  of  at  least  [P/~- j 
and  no  more  than  [P/~]  processors  to  a  work  element. 

Algorithm  W  performs  best  when  the  cluster  size  s  is  chosen  to  be  log  JV,  resulting  in 
a  progress  tree  of  X  /  log  N  leaves.  A  complete  description  of  the  algorithm  can  be  found 
in  [18].  Using  the  above  lemma  for  U  —  N /log N ,  a  tight  bound  can  be  obtained  for 
algorithm  W[log  Arj: 

Theorem  2.2  ([18,  24]).  Algorithm  W[log  Ar]  is  a  robust  parallel  algorithm  for  the  Write- 
All  problem  with  S  =  0(N  +  P  log2  A/  1c  g  log  A: ) ,  where  X  is  the  input  array  size  and  the 
initial  number  of  processors  P  is  between  1  and  N .  □ 


2. S. 2.  PRAM  Simulations 

The  original  motivation  for  studying  the  Write-All  problem  was  that  it  intuitively  cap¬ 
tures  the  essential  nature  of  a  single  synchronous  PRAM  step.  This  intuition  was  made 
concrete  when  it  was  shown  in  [20.  31]  how  to  use  solutions  for  the  Write- All  problem  in 
implementing  general  PRAM  simulations. 
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01  for  all  processors  PID=1  ..P  parbegia  —  Use  P  processors  tc  clear  N  memory  locations 
02  Ctear  the  initial  block  of  N&  —  G&  elements  sequentially  using  P  piwtssors 
03  i  0  —Iteration  counter 
04  while  Ar,  <  N  do 

OS  Use  a  Write- All  solution  with  data  structures  at  size  N,  and  G,+i  elements 

0$  at  the  leaves  to  clear  memory  of  size  jV1+j  =  Ni  ■  Gi-,  i;  * *  -f  1 

0?  od 
OS  pareiid 

Fig.  4.  A  high  level  view  of  algorithm  Z. 

A  high  level  view  of  this  approach  is  given  in  Fig.  3.  The  simulations  are  implemented 
by  robustly  executing  each  of  the  cycles  of  the  PRAM  step:  instruction  fetch,  read  cycle, 
compute  cycle,  write  cycle,  next  instruction  address  computation.  This  is  done  using  two 
generations  of  shared  memory,  “current”  and  “future”,  and  by  executing  each  of  these  cycles 
in  the  Write- All  style,  e.g,,  using  algorithm  W  (for  details,  see  {20,  31)). 

Using  such  algorithm  simulation  techniques  it  was  shown  in  [20,  31]  that  if  SW{N,  P) 
denotes  the  available  steps  of  solving  a  Write- AG  instance  of  sire  N  using  P  processors,  and 
if  a  linear  in  N  amount  of  clear  memory  is  available,  then  any  A? -processor  PRAM  step 
can  be  deterministically  simulated  using  P  fail-stop  processors  and  work  SW(N,  P).  If  the 
Parallel-time  x Processors  of  an  original  Ar-processor  algorithm  is  r  •  N,  then  the  available 
steps  of  the  fault-tolerant  simulation  will  be  0(r  ■  SW(N,  P)). 

2.3. 3.  Static  Initial  Memory  Errors 

The  Write- AU  algorithms  and  simulations,  e.g.,  [18,  20,  22,  31],  or  the  algorithms  that 
can  serve  as  Write- AU  solutions,  e.g.,  the  algorithms  in  {8,  26],  invariably  assume  that  a 
linear  portion  of  shared  memory  is  either  cleared  or  is  initialized  to  known  values.  Starting 
with  a  non-contaminated  portion  of  memory,  these  algorithms  perform  their  computation 
by  “using  up”  the  clear  memory,  and  concurrently  or  subsequently  clearing  segments  of 
memory  needed  for  future  iterations.  An  efficient  Write- All  solution  that  requires  no  clear 
shared  memory  has  recently  been  defined  using  a  bootstrap  approach  [32]. 

The  bootstrapping  proceeds  in  stages:  In  stage  1  all  P  processors  dear  an  initial  segment 
of  A'o  locations  in  the  auxiliary  memory.  In  stage  i  the  P  processors  dear  Ar,+1  —  A",  •  G t+1 
memory  locations  using  A,  memory  locations  that  were  deared  in  stage  t  —  1.  If  A,+1  >  Nt 
and  .Vo  >  1,  then  the  required  N  memory  location  will  be  deared  in  at  most  N  stages. 

The  efficiency  of  the  resulting  algorithm  depends  on  the  particular  Write- AG  solution(s) 
used  and  the  parameters  Ni  and  G(.  In  [32],  a  solution  for  Write- AU  (algorithm  Z,  see  Fig.  4) 
is  presented  that  for  any  failure  pattern  F  (|Fj  <  P)  has  work  0(N  +  Plog3  A7/(logIog  N )2) 
without  any  initialization  assumptions. 

This  solution  uses  algorithm  W[iog  AT]  as  the  underlying  Write- AG  algorithm.  In  the 
early  iterations,  the  number  P  of  processors  may  be  greater  than  the  number  JV,  of  leaves 
of  the  enumeration  tree.  In  this  case  processors  are  assigned  to  the  leaves  by  taking  their 
PIDs  mod  N{,  as  described  in  Section  2.3.1. 
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3.  Controlling  Memory  Access  Concurrency 

Algorithm  W  is  a  robust  fail-stop  Write- All  solution,  however  it  makes  liberal  use  of 
concurrent  memory  accesses  ewn  token  there  are  no  faults.  When  P  —  N.  in  phase  W4 
(Fig.  2,  Hue  03),  the  algorithm  has  «  =s  [n  -  —  fi(Aflog¥)  in  the  absence 

of  failures.  The  same  problem  is  shared  by  all  robust  deterministic  Write- AH  algorithms 
that  have  been  proposed  to  date. 

As  shown  in  [IS],  concurrent  memory  writes  cannot  be  eliminated  altogether.  Any 
Write- AH  algorithm  that  does  not  use  concurrent  writes.  e.g,  CREW  PRAM,  has  worst 
case  work  fl{.¥2)  and  thus  is  not  robust. 

3.1.  Algorithm  W  mth  Controlled  Write  Concurrency 

In  this  section  we  present  a  version  of  algorithm  W  that  controls  the  overall  number  of 
concurrent  accesses.  For  ease  of  presentation  we  wiB  first  develop  methods  to  control  the 
write  concurrency  while  allowing  unlimited  concurrent  reads.  In  the  next  section  we  wiB 
extend  these  techniques  to  control  the  read  concurrency  as  well. 

Assume  there  are  p  <  P  processors  that  want  to  write  to  a  common  memory  location  T. 
Our  technique  for  controlling  the  write  concurrency  is  based  on  organizing  the  p  processors 
into  a  processor  priority  tree  (PPT). 

A  PPT  is  a  binary  tree  whose  nodes  are  associated  with  processors  based  on  a  processor 
numbering.  All  levels  of  the  tree  but  the  last  are  full  and  the  leaves  of  the  last  level  are 
packed  as  left  as  possible.  Thus  these  trees  are  structurally  similar  to  heaps.  PPT  nodes 
are  numbered  from  1  to  p  in  a  breadth-first  left-to-right  fashion  where  the  parent  of  the 
ith  node  has  index  |i/2j  and  its  left  /right  children  have  indices  2i  and  2i  +  1,  respectively. 
Processors  are  also  numbered  from  1  to  p  and  the  ith  processor  is  associated  with  the  ith 
node.  Note  that  these  processor  numbers  need  not  be  the  same  as  their  permanent  PIDs. 
Priorities  are  assigned  to  the  processors  based  on  the  level  they  are  at.  The  root  has  the 
highest  priority  and  priorities  decrease  according  to  the  distance  from  the  root.  Processors 
at  the  same  level  of  the  tree  have  the  same  priority,  which  is  lower  than  that  of  their  parents. 

Processor  priorities  are  used  to  determine  when  a  processor  can  write  to  the  common 
memory  location  T.  All  the  processors  with  the  same  priority  attempt  to  write  to  T 
concurrently  but  only  if  higher  priority  processors  have  failed  to  do  so.  To  accomplish 
this  the  processors  of  a  PPT  concurrently  execute  algorithm  CW,  shown  in  Fig.  5.  Starting 
at  the  root  (highest  priority)  and  -joi-tg  down  the  tree  one  level  at  a  time,  the  processors 
at  each  level  first  read  T  and  then  concurrently  update  it  iff  it  contains  an  old  value.  This 
implies  that  if  level  i  processors  effect  the  writ®,  all  processors  at  higher  levels  must  have 
failed. 

It  is  clear  that  whereas  unrestricted  concurrent  writes  by  p<  P  processors  take  only  one 
step  to  update  T,  a  PPT  requires  as  many  steps  as  there  are  levels  in  the  tree,  i.e.,  O(logp). 
Provided  there  is  at  least  one  live  processor  in  the  tree  the  correct  value  will  eventually  be 
written.  Although  CW  introduces  several  idle  processor  cycles,  it  allows  us  to  bound  the 
number  of  concurrent  writes  incurred  during  an  update  of  T. 
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01  fbrali  processors  ID  at  j..j>  p*r  begin 

02  — Freer  -■»<  rs  write  a  ricui  value  to  location  T  at  follows 

03  for  i  k  i'  ..  [log  pj  do  — For  each  level  i  of  processor  priority  tree 

04  if  T  <  ID  <  2*41  then  —  TTie  processors  at  level  i 

OS  READ  location  T 

00  if  T  does  not  contain  sew  value  then 

0?  WRITE  lit*  value  to  T 

00  8ft 

90  od 

10  parend 


Fig.  5.  Pseudocode  for  the  controlled  write  (CW)  algorithm. 


Lemma  3.1.  If  algorithm  CW  is  executed  in  the  presence  of  a  failure  pattern  F,  then  its 
write  concurrency  u?  is  no  more  than  the  number  of  failures  |F|. 

Proof.  By  the  description  of  CW  above  it  is  dear  that  location  T  is  updated  by  the 
processors  of  at  most  one  level,  say  i.  In  this  case  ail  processors  oa  levels  G, . . . ,  i  ~  1  must 
have  failed.  Since  level  j  has  2J  processors  the  total  number  of  processors  o®  these  levels 
is  q  2J  =  2‘  —  1  and  hence  }Fj  >  2*  —  1.  Oa  the  other  hand,  level  i  has  at  most  2'  live 
processors  and  hence  tv  <  2'  -  1.  The  lemma  follows  by  combining  the  two  inequalities.  O 

We  can  now  modify  algorithm  W  by  incorporating  PPTs  into  its  four  phases.  This 
is  done  by  replacing  each  concurrent  write  in  these  phases  by  an  execution  of  CW.  Since 
in  the  worst  case  all  P  processors  can  form  a  single  PPT,  we  allow  [log  P\  -f  1  steps  for 
each  execution  of  CW,  thus  slowing  down  each  iteration  of  the  original  algorithm  W  by  a 
O(IogF)  factor.  (Although  this  will  he  sufficient  for  our  analysis,  in  practice  we  could  use 
the  overestimate  of  surviving  processors  computed  in  phase  Wl  to  find  the  height  of  the 
largest  possible  PPT  for  the  current  iteration  and  use  this  to  determine  how  much  time  to 
spend  in  CW.)  PPTs  are  organized  and  maintained  as  follows. 

At  the  outset  of  the  algorithm  and  at  the  beginning  of  each  execution  of  phase  Wl 
each  processor  forms  a  PPT  by  itself.  As  processors  traverse  bottom  up  the  trees  used  by 
algorithm  W  they  may  encounter  processors  coming  from  the  sibling  node  (if  there  is  one). 
In  this  case  the  two  PPTs  are  merged  to  produce  a  larger  PPT  at  the  common  parent  of  the 
two  nodes.  In  order  to  be  able  to  use  Lemma  3.1  to  bound  the  overall  write  concurrency 
of  the  algorithm,  we  roust  guarantee  that  a  newly  constructed  PPT  has  no  processors  that 
have  already  been  accounted  for  as  dead  In  the  above  lemma.  Such  processors  ne  those 
that  are  above  the  PPT  level  that  effected  the  write.  In  order  to  accomplish  this  we  compact 
PPTs  during  merging  to  eliminate  these  processors. 

To  compact  and  merge  two  PPTs,  the  processors  of  each  PPT  store  in  the  memory 
location  they  are  to  update  the  size  of  the  PPT,  the  index  of  the  level  that  effected  the 
write  and  a  timestamp  along  with  the  new  value  to  be  stored  in  this  location.  A  timestamp 
in  this  context  is  just  a  sequence  number  that  need  not  exceed  N  (see  [18]).  Since  there 
can  be  at  most  P  processors,  0(log  A)  +  O(logA')  +  O(logP)  +  O(loglogP)  =  O(logA') 
bits  of  information  need  to  be  stored  for  the  new  value,  the  timestamp,  the  size  of  the  tree 


Fig.  6.  Merging  and  compacting  two  PPTs.  There  are  PL  and  Pn  processors  in  the  left 
and  right  PPTs.  respectively.  At  the  end  of  algorithm  CW,  up  to  P[  and  processors 
remained  in  the  PPTs.  The  malting  PPT  will  contain  the  surviving  P£  4-  processors. 


and  the  level  index. 

After  the  log  P  steps  of  CW,  the  processors  of  each  of  the  two  PPTs  that  are  to  be  merged 
concurrently  read  the  information  stored  in  the  two  memory  locations  they  updated  and 
compute  the  size  of  the  resulting  PPT,  which  is  the  sum  of  the  two  sizes  read  less  the  number 
of  the  certifiable  dead  processors  above  the  levels  that  effected  the  writes.  By  convention, 
when  two  PPTs  are  merged  the  processors  of  the  left  PPT  are  added  to  the  bottom  of  the 
right  one.  The  timestamp  is  needed  so  that  processors  can  check  whether  the  values  they 
read  are  current  (i.e.,  there  is  some  PPT  at  the  sibling  node  and  merging  has  to  take  place) 
or  old. 

Using  the  size  and  the  level  index  each  processor  can  locally  compute  its  position  in  the 
merged  tree.  If  Si  and  s2  are  the  sizes  of  the  left  and  right  PPTs  respectively,  and  l\  and 
l2  are  the  corresponding  level  indices,  the  merged  tree  will  have  size  sj  4  ij  -  2‘‘  —  2,a  4-  2 
and  the  first  s2  —  2*s  4- 1  nodes  will  be  occupied  by  the  processors  from  the  right  PPT.  This 
process  is  illustrated  in  Fig.  6. 

We  refer  to  the  algorithm  that  results  from  W  by  incorporating  CW  and  PPT  com¬ 
paction  and  merging  as  algorithm  Wcw *  In  more  detail,  the  four  phases  of  algorithm  W 
need  to  be  modified  as  follows: 

Wl:  At  the  beginning  of  this  phase  each  processor  forms  a  PPT  by  itself.  Processors 
traverse  the  enumeration  tree  bottom  up  storing  at  each  internal  node  of  the  tree 
a  value  based  on  the  values  stored  at  its  two  children.  In  addition,  in  the  original 
algorithm  \V  a  timestamp  is  written  to  distinguish  new  from  old  values.  The  required 
modifications  entail  using  algorithm  CW  for  writing  to  the  nodes  of  the  enumeration 
tree  and  dispensing  with  the  timestamp;  the  latter  is  because  all  values  written  by 
CW  are  already  timestamped  in  order  to  facilitate  PPT  merging  as  discussed  above. 
After  each  call  of  CW  all  PPTs  are  compacted  and  merged. 

\Y2:  This  phase  can  be  implemented  without  concurrent  writes  and  the  only  modification 
is  that  processors  need  to  keep  track  of  the  size  of  the  PPT  they  belong  to  and  their 
position  within  this  PPT  since  this  information  is  needed  in  the  subsequent  phases. 
As  processors  traverse  the  progress  tree  top-down  they  move  to  the  left  or  right  child 
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of  the  current  node  in  proportion  to  the  number  of  unvisited  leaves  in  the  left  and 
right  subtrees,  as  in  the  original  algorithm  W.  This  causes  each  PPT  to  be  split  in  two 
at  each  step.  If  a  PPT  has  k  processors  of  which  k'  need  to  go  left  and  the  remaining 
k  -  k‘  need  to  go  right,  then  by  convention  the  first  k'  processors  of  the  PPT  are 
allocated  to  the  PPT  that  moves  to  the  left  child  and  the  remaining  k  —  k*  processors 
form  the  PPT  that  moves  to  the  right  child. 

W3:  Processors  use  the  information  they  gathered  during  phase  W2  to  organize  themselves 
into  PPTs  and  then  proceed  to  write  1  to  the  input  elements  associated  with  the  leaf 
they  reached  at  the  end  of  W2.  Each  such  write  uses  algorithm  CW  and  is  followed 
by  compaction  to  eliminate  the  dead  processors.  There  is  no  PPT  merging  in  this 
phase.  As  a  result  there  is  no  need  for  timestamping  either. 

W4:  In  this  phase  processors  traverse  the  progress  tree  bottom  up  writing  at  each  node 
the  sum  of  the  values  stored  at  its  two  children.  This  write  is  accomplished  by  calling 
CW  and  is  followed  by  compaction  and  merging.  The  PPTs  used  at  the  beginning  of 
this  phase  are  those  left  at  the  end  of  W3. 

For  the  case  where  each  leaf  of  the  progress  tree  is  associated  with  one  element  of  the 
input  array  we  can  obtain  the  following  bound  on  the  performance  of  Wcw : 

Lemma  3.2.  Algorithm  Wow  ts  a  robust  algorithm  for  the  Write-All  problem  with  ui  <  |F| 
and  work  S  -  0(N  log  N  log  P  +  P  log  P  log2  Nf  log  log  N),  where  1  <  P  <  N. 

Proof.  The  correctness  of  the  algorithm  follows  from  the  correctness  of  algorithms  W  and 
CW  and  from  the  fact  that  we  just  replace  each  write  step  in  W  by  a  call  to  CW. 

To  show  the  bound  on  S  we  invoke  Lemma  2.1  according  to  which  the  total  number  of 
block  steps  of  algorithm  W  is  at  most  U  +  P  log  U  j  log  log  f/,  where  U  is  the  initial  number  of 
unvisited  leaves.  The  number  of  available  processor  steps  is  (U  +  P  log  V/  log  log  U  )(i<  + 1[). 
where  tt  is  the  time  needed  to  traverse  the  processor  enumeration  and  progress  trees  in 
phases  Wl.  W2.  and  W4,  and  is  the  time  spent  at  the  leaves  during  phase  W3. 

In  our  case.  tt  —  0( log  N  log  P)  since  each  execution  of  CW  takes  0(log  P )  time  and  the 
height  of  the  trees  traversed  is  O(logfV),  and  tt  =  O(logP),  the  time  of  a  single  execution 
of  C'W.  Thus,  with  U  =  N 

S  =  (N  +  P  log  Nf  log  log  N  )(0( log  N  log  P)  +  0(log  P)) 

=  0(N  log  N  log  P+P  log  Flog2  Nf  log  log  JV). 

To  show  the  bound  on  we  first  note  that  contributions  to  u  occur  only  during  the 
executions  of  CW  since  there  are  no  other  concurrent  writes.  If  there  were  m  executions  of 
CW.  let  a.’,  denote  the  contribution  of  the  ith  such  execution  to  u,  1  <  i  <  m.  Let  F,  C  F 
be  the  failure  pattern  that  consists  of  the  failures  of  the  processors  above  the  PPT  level  that 
effected  the  write  or  of  ail  the  processors  of  the  PPT  if  all  failed  during  the  ith  execution. 
From  the  proof  of  Lemma  3.1,  ay  <  |F,j.  Due  to  the  compaction  of  PPTs  all  processors 
named  in  F,  are  eliminated  at  the  end  of  CW  and  hence  can  cause  at  most  one  concurrent 
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wri.e  (contribute  at  most  once  to  w).  By  summing  the  contributions  from  all  executions  of 
CW  we  conclude  that  the  overall  write  concurrency  is  u>  = 

Note  that  failed  processors  that  are  below  the  level  that  completed  the  write  during  CW 
are  not  removed  by  compaction  but  they  do  not  cause  any  concurrent  writes.  □ 

3.2.  Algorithm  W  with  Controlled  Read/Write  Concurrency 

In  this  subsection  we  present  extensions  of  algorithm  CW  that  control  read  concurrency 
in  addition  to  write  concurrency.  Our  approach  is  based  on  replacing  concurrent  reads  by- 
broadcasts  through  the  PPTs.  We  first  describe  a  simple  scheme  and  then  give  a  more 
elaborate  and  efficient  method. 

As  before  assume  there  are  p  <  P  processors  that  need  to  write  to  a  common  memory 
location  T.  Algorithm  CW  allows  all  the  processors  of  a  PPT  level  to  concurrently  read  T 
to  check  if  it  has  been  updated.  Our  first  scheme  for  controlling  read  concurrency  replaces 
this  concurrent  read  by  a  broadcast  through  the  processors  of  the  PPT  level.  Specifically, 
we  view  the  processors  of  each  level  as  forming  a  heap-like  binary  tree  where  the  leftmost 
processor  is  the  root,  the  second  and  third  processors  are  the  root’s  children  and  so  on. 

The  broadcast  through  each  such  tree  proceeds  in  a  top-down  fashion  by  having  each 
processor  broadcast  the  value  it  read  to  its  two  children,  starting  with  the  root.  If  a 
processor  does  not  receive  a  value  because  its  parent  has  failed,  then  it  reads  T  directly.  If 
the  sibling  processor  is  alive  it  will  also  read  T  thus  incurring  a  concurrent  read.  The  rest  of 
the  algorithm  is  as  in  CW.  The  tree  formed  by  the  processors  of  the  tth  PPT  level  has  i  -f  1 
levels  of  its  own  so  t  4- 1  steps  are  required  for  the  ith  PPT  level  (i  <  flogp'j).  This  means 
that  algorithm  CW  modified  as  described  here  requires  max{i+  l}logP  =  0(log2  P)  steps. 

Simple  as  this  method  may  be  it  is  not  satisfactory  as  it  does  not  provide  good  bounds 
on  the  read  concurrency.  The  problem  stems  from  the  fact  that  a  failed  processor  may 
remain  in  a  PPT  for  several  steps  and  cause  concurrent  reads  during  each  of  these  steps. 

We  next  present  a  more  elaborate  scheme  that  addresses  this  problem.  The  advantages 
of  this  algorithm  over  the  on<"  described  above  are 

•  it  uses  several  PPT  levels  at  a  time  to  broadcast  instead  of  just  one,  and 

•  it  tries  to  move  failed  processors  towards  the  leaves  of  the  PPT. 

Even  though  a  failed  processor  can  still  remain  in  a  PPT  for  several  steps,  its  movement 
towards  the  leaves  means  that  it  can  cause  concurrent  reads  foT  only  a  small  number  of 
steps,  as  we  will  show  below.  This  algorithm  will  allow  us  to  control  read  concurrency 
without  degrading  the  performance  of  algorithm  Wcw-  It  relies  on  the  observation  that 
each  level  of  a  PPT  contains  one  more  processor  than  all  the  levels  above  it.  This  allows  us 
to  use  levels  0, . . . ,  i  —  1  to  broadcast  to  level  t  in  constant  time. 

Algorithm  CR/W:  A  high  level  view  of  this  algorithm,  that  we  call  CR/W,  is  given  in 
Fig.  7.  Communication  takes  place  through  a  shared  memory  array  ( B  in  Fig.  7).  Such  an 
array  is  used  by  each  PPT  so  that  processors  can  communicate  with  each  other  based  on 
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OJ  — Code  executed  by  each  of  the  p  processor t  in  the  PPT 
02  Initialize  B\I  D]  ——broadcast  location  for  processor  ID 
03  for  i  at  0 . , ,  [log  pj  do  — for  each  level  i  of  the  PPT 
04  if  2*  <  ID  <  2,+1  then  —if  the  processor  it  at  level  i 
05  READ  B[/D] 

06  if  B[JD]  has  not  been  updated  then  — note  that  this  is  always  true  if  ID  st  2,+1  —  1 
07  READ  (value,  level)  from  T 
08  if  T  contains  some  old  value  then 
09  level  :=  i  ——level  i  effects  the  write 

10  WRITE  (newvalue,  level)  to  T 

11  else  if  2Uvtl  <  ID  —  2'  +  1  <  2'  then  — unless  it  is  above  the  level  that  effects  the  write 

12  ID  :=  IjD  —  2'  +  1  —  toJbe  the  place  of  the  processor  that  was  to  update  B[ID] 

13  ft  fill  ft 

14  if  ID  <  2,+J  and  ID  +  2,+1  <  p  then  — if  the  processor’s  level  s  <  i 

15  WRITE  iteintialtie  to  B[/Z)  +  2‘+,3  — broadcast  to  appropriate  processor  at  level  *  +  ] 

IS  ft 

17  od  _ _ 

Fig.  7.  High  level  view  of  algorithm  CR/W  for  the  controlled  read/write.  This  code  is 
executed  by  each  of  the  p  <  P  processors  in  the  PPT.  Here  ID  is  the  position  of  the 
processor  in  the  PPT  and  newvalue  is  the  value  to  be  written.  Processors  read  and  write 
pairs  that  include  T's  value  as  well  as  the  index  of  the  PPT  level  that  effects  the  write. 


their  positions  in  the  PPT;  B[k]  stores  values  read  by  the  kth  processor  of  the  PPT.  This 
allows  processors  to  communicate  without  requiring  them  to  know  each  other’s  identity. 
For  all  the  PPTs,  we  can  store  these  B  arrays  in  a  segment  of  memory  of  size  P;  we  discuss 
an  implementation  later  on  in  this  section. 

Each  processor  on  levels  0,  1  is  associated  with  exactly  one  processor  on  each  of 

levels  i. !  4- 1 .  —  Specifically,  the  jfth  processor  of  the  PPT  is  responsible  for  broadcasting  to 
the  jth  (in  a  left-to-right  numbering)  processor  of  each  level  below  its  own.  The  algorithm 
proceeds  in  [logpj  +  1  iterations  that  correspond  to  the  PPT  levels.  At  iteration  t,  each 
processor  of  level  i  reads  its  B  location  (line  5).  If  this  location  has  not  been  updated,  then 
the  processor  reads  T  directly  (lines  6-7). 

Since  a  PPT  level  has  potentially  more  live  processors  than  all  the  levels  above  it  com¬ 
bined.  there  is  in  general  at  least  one  processor  on  each  level  that  reads  T  since  no  processor 
at  a  higher  level  is  assigned  to  it.  If  a  level  is  full,  this  processor  is  the  rightmost  one  (the 
root  of  the  PPT  for  level  0).  As  long  as  there  are  no  failures  this  is  the  only  direct  access 
to  T.  Concurrent  accesses  can  occur  only  in  the  presence  of  failures.  In  such  a  case  several 
processors  on  the  same  level  may  fail  to  receive  values  from  processors  at  higher  levels,  in 
which  case  they  will  access  T  directly  incurring  concurrent  reads. 

A  processor  that  needs  to  read  T  directly  first  checks  whether  it  contains  the  value  to 
be  written  (line  8)  and  then  writes  to  it  (line  10)  if  it  does  not.  As  can  be  seen  from  the 
figure,  whenever  processors  update  T  they  write  not  only  the  new  value  for  T  but  also  the 
index  of  the  level  that  effected  the  write  (for  simplicity  we  leave  out  the  timestamps  and 
the  size  of  the  tree). 
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01  ~~-Code  executed  by  each  of  the  p  processors  in  the  PPT 
02  Initialize  B[ID]  — broadcast  location  for  processor  ID 
03  for  i  =  0 . . .  [log  pj  do  — for  each  level  i  of  the  PPT 
04  if  2*  <  ID  <  2*+1  then  — —if  the  processor  is  at  level  * 

OS  READ  B[ID) 

06  if  B{ID]  has  not  been  updated  then  — note  that  this  is  always  true  if  ID  —  2<+1  —  1 
07  READ  T 

08  if  ID  yl  2,+1  —  1  then  — unless  this  is  the  rightmost  processor 
09  ID  :=  ID  —  2’  +  1  —take  the  place  of  the  processor  that  was  to  update  B[ID) 

10  fl  fl-fi 

H  if  ID  <  2,+1  and  ID  +  2,+1  <  p  then  — if  the  processor’s  level  is  <  * 

12  WRITE  value  read  to  B[ID  +  2,+l]  — broadcast  to  appropriate  processor  at  level  i  + 1 

13  fi 

14  od 

Fig.  8.  High  level  view  of  algorithm  CR1  for  the  controlled  read.  The  code  is  executed  by 
each  of  the  p  <  P  processors  in  the  PPT.  Here  ID  is  the  position  of  the  processor  in  the 
PPT,  which  may  change  during  the  execution  of  the  algorithm. 

If  a  processor  P*  that  accesses  T  directly  verifies  that  T  has  the  correct  value  and  the 
failed  processor  that  should  have  broadcast  to  P*  is  at  or  below  the  level  that  effected  the 
write,  then  P*  assumes  the  position  of  the  failed  processor  in  the  PPT  (lines  1  i— 12).  This 
effectively  moves  failed  processors  towards  the  leaves  of  the  PPT  and  plays  an  important 
role  in  establishing  the  bound  on  the  read  concurrency.  Failed  processors  are  moved  towards 
the  leaves  only  if  they  are  not  above  the  level  that  effects  the  write  since  processors  above 
this  level  will  be  eliminated  by  compaction  as  was  done  in  the  previous  section. 

Algorithms  CRl  and  CR2:  Algorithm  CR/W  combines  a  read  with  a  write  (i.e.,  it 
incorporates  the  functionality  of  the  old  CW).  In  addition  to  CR/W  we  use  two  simpler 
algorithms  when  all  the  processors  of  a  PPT  need  to  read  a  common  memory  location  but 
no  write  is  involved. 

The  first  of  them  is  similar  to  CR/W  but  without  the  write  step.  This  algorithm, 
referred  to  as  CRl,  is  presented  in  Fig  8.  It  is  simpler  than  CR/W  since  processors  that  are 
found  to  have  failed  are  pushed  towards  the  bottom  of  the  PPT  independent  of  the  level 
they  are  at.  This  algorithm  will  be  used  for  bottom-up  traversals  during  W. 

The  second  algorithm,  referred  to  as  CR2,  uses  a  simple  top-down  broadcast  through 
the  PPT.  Starting  with  the  root  each  processor  broadcasts  to  its  two  children;  if  a  processor 
fails  then  its  two  children  read  T  directly.  Thus  the  processors  of  level  t  are  responsible  to 
broadcast  only  to  processors  of  level  *  -f  1.  Unlike  CRl  no  processor  movement  takes  place. 
This  algorithm  will  be  used  for  top-down  traversals  during  W. 

It  is  easy  to  see  from  the  description  of  the  algorithms  that  all  three  of  CR/W,  CRl, 
and  CR2  require  O(log  P)  time  for  PPTs  of  at  most  P  processors. 

Algorithm  W cr/w:  We  are  now  ready  to  present  a  Write- All  algorithm  that  controls 
both  read  and  writ**  concurrency.  It  utilizes  algorithms  CR/W.  CRl,  and  CR2  for  memory 
accesses  as  well  as  the  techniques  of  PPT  merging  and  compaction  introduced  in  the  previous 
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section.  As  with  Wcw ,  we  augment  the  values  written  by  CR/W  with  timestamps  to 
facilitate  PPT  merging  and  also  write  the  PPT  size  to  assist  with  compaction  {for  simplicity, 
these  two  additional  pieces  of  information  aTe  not  shown  in  Fig.  7).  The  resulting  algorithm 
will  be  referred  to  as  Wcr/w-  In  more  detail,  the  following  modifications  must  be  made  to 
the  four  phases  of  the  original  algorithm  W: 

Wl:  Processors  begin  this  phase  by  forming  single-processor  PPTs.  The  objective  is  for 
processors  to  write  to  each  internal  node  of  the  enumeration  tree  the  sum  of  the  values 
stored  at  its  two  children.  For  this  they  use  algorithm  CR/W  to  store  the  new  value, 
the  size  of  the  PPT,  the  index  of  the  level  that  completed  the  write,  and  a  timestamp. 
As  with  Wcw  the  timestamps  written  in  algorithm  W  are  not  required. 

Upon  completion  of  CR/W  all  PPTs  are  compacted.  In  order  to  merge  PPTs  the 
processors  of  each  need  to  read  the  data  stored  at  the  enumeration  tree  node  that  is 
the  sibling  of  the  node  they  just  updated.  For  this,  algorithm  CR1  is  used  and  after 
it.  finishes  PPTs  are  merged.  At  this  point  the  processors  of  the  merged  PPTs  know' 
the  value  they  need  to  write  at  the  next  level  of  the  enumeration  tree.  This  value  is 
just  the  sum  of  the  value  written  by  CR/W  and  the  value  read  by  CR1.  Therefore 
one  call  to  each  of  CR/W  and  CR1  is  needed  for  each  level  of  the  enumeration  tree. 

W2:  This  phase  involves  no  concurrent  writes  and  can  be  implemented  using  mostly  local 
computations.  Surviving  processors  traverse  top-down  the  progress  tree  to  allocate 
themselves  to  the  unvisited  leaves.  The  only  global  information  needed  at  each  level 
is  the  values  stored  at  the  two  children  of  the  current  node  of  the  progress  tree.  These 
values  are  read  using  two  calls  to  CR2,  one  for  each  child.  Using  this  information 
the  processors  of  a  PPT  compute  locally  whether  they  need  to  go  left  or  right.  No 
compaction  or  merging  is  done  in  this  phase.  As  PPTs  move  downwards  they  are  split 
as  described  in  phase  W2  of  algorithm  Wcw. 

W3:  Processors  organize  themselves  into  PPTs  based  on  the  information  they  gathered 
during  the  previous  phase  and  proceed  to  write  1  to  the  locations  that  correspond  to 
the  leaf  they  reached.  Each  of  these  writes  uses  algorithm  CR/W  and  is  followed  by 
compaction.  No  merging  is  involved. 

W4:  This  phase  initially  uses  the  PPTs  that  resulted  at  the  end  of  W3.  The  task  to  be 
performed  is  similar  to  that  of  Wl.  As  before  CR/W  is  used  for  writing  followed  by 
compaction  and  then  one  call  to  CR1,  after  which  PPTs  are  merged. 

Before  embarking  on  the  analysis  of  this  algorithm,  we  discuss  briefly  how  to  accomodate 
the  B  arrays  of  all  the  PPTs  in  0(P)  space.  We  use  an  array  B  of  length  P  and  implement 
the  B  array  of  each  PPT  as  a  contiguous  subarray  of  B.  Each  PPT  maintains  the  starting 
location  of  this  subarray,  whose  length  is  just  the  size  of  the  PPT. 

Initially  each  PPT  has  a  single  processor  and  processor  P,  is  allocated  location  B[i}. 
When  two  PPTs  are  merged  we  need  to  combine  their  B  subarrays.  If  the  left  PPT  uses  a 
subarray  of  size  sj  that  starts  at  and  the  corresponding  parameters  for  the  right  PPT 


are  sr  and  #[tr],  then  the  merged  PPT  will  use  a  subarray  of  size  S[  +  sT  starting  at  5[tjJ. 
The  only  information  that  needs  to  be  communicated  between  the  two  PPTs  is  that  the 
right  PPT  must  know  tt.  This  is  accomplished  by  having  the  processors  of  the  left  PPT 
write  tj  along  with  the  new  value  for  T ,  the  timestamp,  the  PPT  size  and  the  level  index. 
Note  that  the  subarrays  used  by  the  two  PPTs  that  are  merged  may  not  be  contiguous  in 
B  but  in  this  case  the  space  between  the  two  subarrays  is  unused  (i.e.,  there  may  be  a  gap 
between  them'  since  the  two  PPTs  are  siblings. 

When  PPTs  are  split  during  W2  their  B  subarrays  are  also  split.  If  a  PPT  of  size  s  that 
uses  a  subarray  starting  at  B[t]  is  split  into  a  left  PPT  of  size  s'  and  a  right  PPT  of  size 
s  -  then  the  subarray  for  the  left  PPT  will  start  at  Sit]  and  the  one  for  the  right  PPT 
will  start  at  B[t  +  s'].  Only  local  computations  are  involved  here. 

Analysis  of  algorithm  W^r/w1  Although  we  can  guarantee  a  write  concurrency  of  at 
most  |F|  for  algorithm  W^r/w,  the  read  concurrency  is  slightly  more.  This  is  because  PPT 
compaction  does  not  eliminate  all  the  failed  processors  from  a  PPT  but  only  those  above 
the  level  that  completed  the  write.  Failed  processors  below  this  level  remain  in  the  PPT 
and  although  they  do  not  cause  any  concurrent  writes  they  may  cause  concurrent  reads. 

In  order  to  account  for  the  concurrent  reads  incurred  we  associate  a  potential  with  each 
processor.  Our  intention  is  to  make  the  overall  potential  au  upper  bound  on  p.  Below  we 
explain  how  this  is  done. 

Let  denote  the  potential  of  the  the  ith  processor.  1  <i  <  P,  let  $ppr  -  e  ppt 
denote  the  potential  of  a  PPT,  atid  let  $  =  £i<i<p  be  the  overall  potential.  We  use 
to  denote  the  change  in  the  potential  of  the  tth  processor  during  an  operation  or  sequence 
of  operations.  Initially  -  0,  1  <  i  <  P. 

During  each  operation  we  increase  the  potential  of  a  failed  processor  by  an  amount  that 
is  an  upper  bound  on  the  number  of  concurrent  reads  that  can  potentially  be  incurred  during 
this  operation  due  to  the  failure  of  this  processor.  As  a  result  $  will  be  an  upper  bound 
on  the  total  number  of  concurrent  reads  incurred  during  an  exe-ution  of  the  algorithm,  i.e., 
/)  <  4*.  The  following  lemma  describes  the  required  increase  in  the  potential  of  a  failed 
processor  during  an  execution  of  each  of  the  algorithms  CR/W,  CRl  and  CR2. 

Lemma  3.3.  Let  1 1  be  the  PPT  level  of  a  processor  at  the  beginning  of  an  execution  of  any 
of  the  algorithms  CRl,  CR2,  or  CR/W  and  let  l2  be  the  PPT  level  of  the  same  processor 
at  the  end  of  this  execution.  If  this  processor  is  dead  at  the  end  of  this  execution,  then  its 
potential  should  be  increased  by  (i)  l2  -  h  during  CRl,  (ii)  2  during  CR2,  and  (iii)  l2  -  h 
if  it  is  at  or  below  the  level  that  completes  the  write  and  log  P  -  if  it  is  above  this  level 
during  CR/W. 

Proof.  We  need  to  show  that  the  proposed  increases  in  the  potential  of  a  failed  processor 
are  upper  bounds  on  the  number  of  concurrent  reads  that  can  be  incurred  due  to  its  failure, 
(i)  From  the  description  of  the  algorithm  it  is  clear  that  concurrent  reads  are  the  result 
of  processors  discovering  that  processors  at  higher  PPT  levels  have  failed  and  could  not 
broadcast  as  a  result.  In  this  case  the  failed  processors  move  to  lower  PPT  levels  as  the 
processors  that  detect  the  failuiss  take  their  position  in  the  PPT.  Since  a  failed  processor 
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moves  by  at  least  one  level  for  each  concurrent  read  it  causes,  the  number  of  concurrent 
reads  caused  by  a  failed  processor  that  moves  from  level  lj  to  level  fg  can  be  at  most  lj  ~  l\, 
(ii)  Since  in  CR2  a  processor  is  responsible  only  for  broadcasting  to  its  two  children,  a  failed 
processor  can  cause  at  most  two  concurrent  reads.  This  happens  if  both  of  its  children  are 
alive  and  other  processors  at  the  same  PPT  level  are  also  accessing  the  same  location  as  a 
result  of  their  parents  also  having  failed. 

(ili)  Let  L  be  the  index  of  the  PPT  level  that  completed  the  write.  If  a  processor  is  above 
L  then  no  processor  wiB  take  its  position  in  the  PPT.  Since  the  PPT  can  have  up  to  log  P 
levels  this  processor  can  cause  a  concurrent  read  at  each  of  the  levels  below  its  own  resulting 
in  log  P  —  It  concurrent  reads.  If  the  processor  is  at  or  below'  L  then  the  situation  is  similar 
to  that  of  case  (i)  and  by  the  same  argument  the  processor  will  incur  h  ~  h  concurrent 
reads.  Q 

Because  of  PPT  compaction  it  is  possible  for  some  failed  processor  to  actually  move  to 
a  higher  PPT  level  after  an  execution  of  CR/W  (note  that  compactions  occur  only  after 
CR/W).  To  account  for  this  we  charge  each  processor  that  is  eliminated  by  compaction  for 
log  P  concurrent  reads.  As  the  following  lemma  shows,  this  one  time  charge  is  sufficient  to 
account  for  all  tve  concurrent  reads  that  can  be  caused  by  failed  processors  that  move  to 
higher  levels  following  a  compaction. 

Lemma  3.4.  If  each  processor  that  is  eliminated  by  compacting  a  PPT  is  charged  log  P 
and  each  processor  that  moves  to  a  higher  level  as  a  result  of  the  compaction  is  credited  for 
the  number  of  levels  it  moves,  then  the  potential  $ppt  °f  the  PPT  does  not  increase  and 
neither  does  $. 

Proof.  Let  L  be  the  PPT  level  that  completes  the  write.  As  a  result  of  compaction  2L  -  1 
processors  will  be  eliminated.  Increasing  the  potential  of  each  such  processor  by  log  P 
increases  the  potential  of  the  PPT  by  (2L  —  l)logP. 

The  absence  of  these  2l  -  1  processors  from  the  compacted  PPT  means  that  some 
processors  at  levels  >  L  move  up.  Since  2^  —  1  processors  are  eliminated,  the  first  2^  —  1 
processors  of  each  level  >  L  will  move  up  while  all  other  processors  will  move  left  within 
their  levels  but  will  not  change  level.  We  decrease  the  potential  of  each  processor  that 
moves  up  by  the  number  of  levels  it  moves. 

The  2l  -  1  processors  of  level  L  that  will  move  may  move  by  as  many  as  L  levels,  so  the 
decrease  in  their  potential  is  at  most  L  (this  is  a  pessimistic  estimate).  The  processors  of 
the  other  levels  that  will  move  may  move  by  just  one  level,  so  their  potential  is  decreased 
by  1. 

The  total  decrease  in  potential  for  all  processors  that  move  to  higher  levels  is  at  most 

lo  %p 

(2l-1)L+  £  (2i-l)  =  ^2i- DlogP, 

i=L+l 

equal  to  the  increase  in  potential.  Thus  potential  redistribution  does  not  increase  Qppj 
(since  we  are  taking  a  pessimistic  approach,  ippr  may  actually  decrease  du<;  to  this  redis¬ 
tribution).  As  a  consequence  $  does  not  increase  either.  □ 
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The  remaining  operation  involved  in  PPT  merging,  can  never  cause  a  processor 

to  move  to  a  higher  level.  As  a  result  we  neither  increase  nor  decrease  processor  potentials 
during  this  operation,  so  we  ignore  merging  in  the  rest  of  our  analysis. 

During  phases  W1  and  W4  the  main  operation  at  each  level  of  the  enumeration  or 
progress  tree  entails  one  call  to  each  of  CR/W,  PPT  compaction,  and  CR1  in  this  order. 
Phase  W3  is  similar  but  without  CR1  (phase  W2  is  quite  different).  The  following  lemma 
provides  a  bound  on  the  change  of  potential  during  a  sequence  of  such  operations. 

Lemma  3.5.  Let  s  be  the  length  of  a  sequence  of  steps  each  of  which  consists  of  a  call  to 
CR/W,  PPT  compaction ,  and  CRl  in  this  order  and  let  Pi  be  a  processor  that  is  dead  at 
the  end  of  this  sequence  but  is  not  eliminated  by  compaction  during  the  sequence.  If  Pi  is  at 
PPT  level  l0  at  the  beginning  of  the  sequence  and  at  level  lj  after  the  jth  step  of  the  sequence 
(l  <  j  <  s):  then  A$,  —  I3-l0  over  the  entire  sequence. 

Proof.  Consider  the  jth  step  of  the  sequence.  Pi  is  at  level  lj-\  at  the  beginning  of  this 
step  and  at  level  lj  at  the  end  of  it.  If  Pi  is  at  level  lj  >  lj- 1  after  the  call  to  CR/W  and 
at  level  /"  <  /'  after  compaction,  then  increases  by  lj  —  lj- 1  +  lj  —  lj  (Lemma  3.3)  and 
decreases  by  /'  -  I'j  (Lemma  3.4).  Thus  A4><  =  lj  -  lj- j  for  the  jth  step  and  for  the  entire 
sequence  we  obtain 

This  shows  that  the  change  in  potential  depends  only  on  the  starting  and  ending  PPT  levels 
and  not  on  the  length  of  the  sequence.  ° 

As  in  the  case  of  algorithm  W,  we  may  cluster  several  elements  of  the  input  array 
and  assign  them  to  a  single  leaf  of  the  progress  tree.  We  use  the  notation  WCR/WM  to 
indicate  that  input  elements  are  organized  into  clusters  of  size  s.  For  the  case  s  =  1  we 
use  the  shorthand  WCR/W  for  WCR/w[l).  The  following  theorem  provides  bounds  on  the 
performance  of  this  special  case: 

Theorem  3.8.  Algorithm  WCR/w  w  a  robust  algorithm  for  the  Write-All  problem  with 
S  -  0  ( .Y  log  Ar  log  P  +  P  log  P  log2  $ /  log  log  A7 ) ,  read  concurrency  p  <  TjFjlogjV,  and 
write  concurrency  u>  <  jFj,  where  1  <  P  <  N . 

Proof.  The  bounds  on  S  and  u  are  obtained  as  in  Lemma  3.2.  To  show  the  bound  on  p 
we  consider  the  change  A4\  of  the  potential  of  a  failed  processor  throughout  a  block  step 
(an  iteration  of  the  while  loop).  It  suffices  to  focus  on  a  single  block  step  since  PPTs  are 
rebuilt  at  the  beginning  of  phase  Wl  and  dead  processors  are  eliminated.  As  a  result,  a 
failed  processor  can  cause  concurrent  reads  for  at  most  one  block  step. 

Suppose  that  a  processor  is  eliminated  by  compaction  when  it  is  at  level  l  of  a  PPT. 
It  is  clear  that  the  processor  causes  the  maximum  number  of  concurrent  reads  when  this 
elimination  takes  place  in  phase  W4  while  the  processor  has  failed  since  phase  Wl. 

When  it  is  eliminated  its  potential  will  increase  by  2  log  P  -  f,  where  log  P  is  due  to  the 
potential  redistribution  as  described  in  Lemma  3.4,  and  log  P  -  l  is  due  to  the  concurrent 
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reads  it  may  cause  in  the  last  execution  of  CR/W  right  before  the  compaction  that  eliminates 
it  (Lemma  3.3,  (iii)).  >. 

The  concurrent  reads  it  causes  until  its  elimination  are  accounted  for  as  follows.  In 
phase  Wl  the  sequence  CR/W,  PPT  compaction,  CR1,  is  executed  once  for  each  level  of 
the  enumeration  tree.  If  the  processor  is  at  PPT  level  ls  at  the  beginning  of  the  phase  and 
at  level  lj  at  the  end,  then  by  Lemma  3.5  the  change  in  potential  is  If  —  It  <  logP  since 
the  height  of  the  PPT  is  at  most  log  P. 

Since  the  PPTs  that  exist  at  the  end  of  W3  are  the  same  as  those  used  at  the  beginning  of 
W4,  as  far  as  accounting  goes  these  two  phases  can  be  considered  as  being  one.  Consider  the 
sequence  of  calls  to  CR/W,  compaction,  and  CR1  upto  the  call  to  CR/W  that  immediately 
precedes  the  compaction  that  eliminates  this  processor.  The  processor  is  at  level  l  at  the 
end  of  this  sequence.  If  it  is  at  level  l'  at  the  beginning  of  W3,  then  by  Lemma  3.5  the 
change  in  potential  is  /  —  /'</. 

Finally,  the  change  in  potential  for  phase  W2  is  2  for  each  execution  of  CR2  and  there 
are  two  such  executions  for  each  level  of  the  progress  tree.  Thus  the  change  in  potential  is  4 
for  each  level  of  the  progress  tree  and  the  total  change  in  potential  for  phase  W2  is  4  log  N. 

C  ombining  the  above  we  have  that  A$t  for  an  entire  block  step  is  bounded  by 

log  P  +  4  log  Ar  +  l  +  2  log  P  -  l  <7  log  N. 


Thus  the  amortized  number  of  concurrent  reads  charged  to  a  failed  processor  is  at  most 
7  log  A' .  Since  the  potential  of  alive  processors  is  non-positive  we  obtain  the  following  bound 
for  p 

P<*=  £  *i<  £  $,<7|p|iogjv. 

1<«<P  Pi  failed 


□ 


8.3.  An  Optimal  Algorithm 

As  discussed  in  Section  2.3.1,  algorithm  W  becomes  optimal  if  input  elements  are  clus¬ 
tered  into  groups  of  size  log  Ar,  each  group  associated  with  one  leaf  of  the  progress  tree,  and 
a  number  of  processors  smaller  than  N  is  deployed.  In  a  similar  vein,  algorithm  WCR,w 
performs  best  if  clusters  of  size  larger  than  one  are  used  and  fewer  than  N  processors  are 
employed. 

Since  the  enumeration  and  progress  trees  traversed  have  height  0(log  N)  and  algorithm 
CR/W  requires  O(logP)  time,  phases  Wl,  W2,  and  W4  require  O(loglVlogP)  time.  Thus, 
we  cluster  the  input  elements  into  groups  of  size  log  N  log  P  and  have  a  progress  tree  with 
Nf  leg  N  log  P  leaves.  The  resulting  algorithm  WCR/w[log  NlogP]  turns  out  to  be  non- 
optimal,  so  a  further  modification  is  needed. 

To  obtain  optimality  we  make  use  of  the  perfect  allocation  property  of  algorithm  W 
(see  [18]).  This  property  guarantees  that  available  processors  are  allocated  evenly  to  the 
un visited  leaves  in  phase  W2.  Let  U,  and  Pi  be  the  numbers  of  unvisited  leaves  and  available 
processors,  respectively,  at  the  beginning  of  the  ith  iteration  of  the  while  loop  of  algorithm 
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Wcr/vv-  The  balanced  allocation  property  implies  that  if  Ik  >  Pi,  then  at  most  one 
processor  will  be  allocated  to  each  unvisited  leaf. 

This  le'-ds  to  a  modification  of  algorithm  Wcr/wP°S  N  log  Pj.  The  difference  in  tins' 
algorithm  is  that  processors  examine  the  overestimate  of  remaining  processors  P,  and  the 
overestimate  of  the  number  of  unvisited  leaves  Vi  at  the  beginning  of  each  iteration.  If 
Vi  >  P{  we  are  guaranteed  that  there  wil  be  at  most  one  processor  per  leaf  at  the  beginning 
of  phase  W3  so  that  there  is  no  need  to  use  CR/W  and  compaction.  Instead  processors  go 
sequentially  through  the  log  AT  log  P  dements  of  the  leaf  they  reached  spending  0(1)  time 
foT  each  element.  If,  on  the  other  hand,  Ik  <  Pi  several  processors  may  be  allocated  to  the 
same  leaf  and  algoritmn  CR/W  needs  to  be  used.  The  latter  situation  occurs  infrequently 
enough  that  the  algorithm  becomes  optimal  for  a  certain  range  of  processors,  as  we  show 
below. 

We  refer  to  algorithm  Wcr/w  [log  AT  log  P]  with  the  above  modification  as  algorithm 
Wcr/vv*  The  following  theorem  provides  bounds  on  its  performance. 


Theorem  3.7.  Algorithm  W^w  is  a  robust  algorithm  for  the  Write-All  problem  with 

S  =  0{N  +  P^-^i|^--),  write  concurrency  <  |F|,  and  read  concurrency  p  <  7  jFjlogA, 
where  i  <  P  <  /?. 


Proof.  Let  P,  be  the  (overestimate  of  the)  number  of  live  processors  and  L\  be  the  (over¬ 
estimate  of  the)  number  of  unvisited  leaves  at  the  begining  of  the  tth  iteration.  We  divide 
the  iterations  into  two  cases: 


Cast  1:  The  number  of  unvisited  leaves  is  V%  >  Pi.  In  this  case  phase  W3  requires 
O(log.YlogP)  time,  the  same  as  the  time  to  traverse  the  enumeration  and  progress  trees. 
Bv  Lemma  2.1  the  number  of  block  steps  for  all  such  iterations  is  at  most  Uq,  w'here  f’o 
is  the  initial  number  of  leaves.  In  this  case  Uq  —  N j  log  N  log  P  and  the  total  work  for  all 


such  iterations  is 


Si 


N 

log  N  log  P 


0(logArloc  P)  -  0(Ar). 


Case  2;  The  number  of  unvisited  leaves  is  L,  <  P,-.  In  this  case  all  the  processors  could 
be  allocated  to  the  same  leaf.  Thus,  in  the  worst  case,  log  P  time  must  be  spent  at  each 
element  of  the  leaf  and  since  there  are  log  AHog  P  dements  per  leaf  the  worst  case  time 
to  update  a  leaf  is  (9 (log  N  iog2  P).  By  Lemma  2.1  the  number  of  block  steps  for  all  such 
iterations  is  at  most  0{P^fJ^\})-  Hence  the  total  work  of  all  these  iterations  is 


S2 


0{P 

0{P 


v)  (0(log  N  log  P)  +  0( log  N  log2  P)) 
log  log  A 

log2  N  log2  P . 

log  log  A 


In  addition,  we  must  account  for  the  work  performed  by  processors  that  fail  in  the 
middle  of  an  iteration.  There  are  at  most  0(P)  such  block  steps  each  costing  at  most 
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0{log  JVlog2 P)  so  that  the  total  work  In  this  case  is  So  =  0(P log  N  log2  P).  Combining 
the  three  cases,  we  have 


5  -  So  +  5,  +  fi,  =  0(JV  t  P- £!£&■-). 


The  algorithm  is  optimal  if  the  second  term  is  0{N). 

The  bound  on  us  is  obtained  as  before  by  noticing  that  whenever  a  PPT  has  more  than 
one  processor,  the  tree  is  compacted  after  the  write  to  eliminate  any  processors  that  have 
caused  concurrent  writes.  The  bound  on  p  is  obtained  as  in  Theorem  3.6.  □ 

3-4-  Handling  Unitialized  Memory 

The  algorithms  considered  so  far  require  that  memory  be  initially  clear.  Algorithm  Z 
of  [32]  extends  algorithm  W  to  handle  unitialized  memory  (see  Section  2.3.3).  It  is  possible 
to  incorporate  CR/W  into  Z  to  obtain  a  Write- All  algorithm  with  limited  concurrency  that 
dispenses  with  the  requirement  that  memory  be  dear. 

A  problem  that  arises  if  we  try  to  combine  the  two  algorithms  directly  is  that  we  can 
no  longer  guarantee  good  read  and  write  concurrency.  This  stems  from  the  fact  that  to 
achieve  these  bounds  the  PPTs  that  are  built  at  the  beginning  of  phase  Wl  must  contain 
no  processors  that  are  known  to  have  failed.  This  is  easy  to  do  in  the  algorithms  considered 
so  far  since  each  PPT  starts  out  with  at  most  one  processor.  In  algorithm  Z,  however,  there 
can  he  several  processors  that  are  assigned  to  the  same  leaf  of  the  enumeration  tree  and 
thus  belong  to  the  same  PPT  at  the  begining  of  Wl;  this  happens  whenever  P  >  JVj.  The 
assignment  of  processors  to  such  leaves  is  made  based  on  the  processor  PIDs  and  since  this 
does  not  take  into  account  failed  processors,  the  resulting  PPTs  may  have  dead  processors 
that  have  already  been  accounted  for  in  u. 

To  work  around  this  we  will  use  an  enumeration  tree  with  P  leaves  for  all  iterations  of  Z. 
so  that  each  PPT  will  have  at  most  one  processor  at  the  beginning  of  Wl,  To  initialize  tins 
tree  we  note  that  phase  Wl  is  a  static  phase  in  the  sense  that  a  given  processor  traverses 
the  same  path  through  the  enumeration  tree  every  time  Wl  is  executed.  In  particular, 
since  processors  can  not  restart  once  they  have  failed,  if  a  node  of  the  enumeration  tree 
is  accessed  at  all,  it  must  be  accessed  the  first  time  Wl  is  executed.  The  consequence  of 
this  is  that  we  do  not  need  to  clear  the  entire  enumeration  tree  but  only  that  portion  that 
can  be  accessed  during  the  first  execution  of  Wl.  To  achieve  this,  processors  make  an 
initial  bottom-up  traversal  of  the  enumeration  tree  similar  to  the  traversal  of  Wl  (i,e.,each 
processor  traverses  the  same  path  it  would  traverse  in  Wl)  and  clear  the  locations  that  they 
would  access  in  Wl.  This  process  is  carried  out  as  follows. 

Processors  start  at  the  leaves  of  the  enumeration  tree,  with  the  ith  processor  at  the  tth 
leaf,  so  that  each  processor  forms  a  PPT  by  itself.  PPTs  then  traverse  the  tree  bottom-up 
taking  the  Same  paths  as  in  phase  Wl.  The  operation  performed  by  the  PPTs  at  each 
node  they  reach  consists  of  clearing  the  node  they  are  at  using  CR/W,  PPT  compaction, 
clearing  the  sibling  node  using  CR/W,  PPT  compaction,  reading  the  contents  of  the  node 
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they  are  at  using  CR1,  and  finally  PPT  merging.  The  reason  for  cleaning  the  sibling  node 
is  that  during  phase  W 1  processors  need  to  access  the  node  they  are  at  and  its  sibling  and 
we  need  to  make  certain  that  this  sibling  will  not  contain  garbage  data-  The  reason  for  the 
remaining  operations  is  to  assist  with  merging.  Essentially,  this  initialization  phase  is  like 
Wl  but  processors  clear  the  sibling  nodes  instead  of  summing.  The  cost  of  this  initialization 
is  O(log  N  log  P)  since  the  enumeration  tree  is  of  ht'ght  0(logN)  and  the  time  spent  at 
each  level  is  G(log  P). 

The  algorithm  that  results  from  Z  by  using  a  P-leaf  enumeration  tree  initialized  as  de¬ 
scribed  above  and  utilizing  algorithm  W^,w  to  dear  progressively  larger  areas  of  memory 
is  called  Zcr/w  and  has  the  performance  bounds  given  in  the  following  theorem: 

Theorem  3,8.  Algorithm  Zcr/w  n?sth  contaminated  memory  oj  size  0(P )  is  a  robust 
Write-All  algorithm  with  S  =  0(N  +  Plog2  Plog3  JV/(log]og  AT)2),  read  concurrency  p  < 
7  (Pj  log  N ,  and  write  concurrency  u  <  |Fj,  where  1  <  P  <  A7. 

Proof.  Since  W^,w  uses  dusters  of  si2e  log N log P  we  choose  Gi  —  log N log P, i  >  0 
as  the  parameters  for  Z  (see  Fig.  4).  Thus  initially  processors  dear  a  tnemory  area  of  size 
Go  =  log  iV  log  P  using  algorithm  CR/W;  this  costs  Plog  N  log2  P.  Following  this  there  are 
log  A7{log(log  Ar  log  P))  -  1  -  0(log  N l  log  log  N )  applications  of  algorithm  W^/w  where 
the  ith  application  dears  a  memory  area  of  size  A”,  =  (log  N  log  P},+1 .  This  analysis  is 
very  similar  to  the  analysis  of  [32]  and  the  work  bound  follows  by  the  same  method;  the 
only  difference  is  the  different  value  for  the  s  and  that  the  work  of  each  iteration  is  given 
by  Theorem  3.7.  The  bounds  on  p  and  u  are  obtained  as  in  the  previous  theorems  in  this 
section.  O 


3.5.  A  Lower  Bound 

Having  shown  a  robust  Write- All  algorithm  with  w  <  |Fj  it  is  interesting  to  consider 
whether  there  is  a  robust  algorithm  with  even  smaller  w.  Here  we  show  that  this  is  not 
the  case  for  a  large  class  of  functions  of  jFj.  Specifically,  we  show  that  there  is  no  robust 
algorithm  for  the  Write- .411  problem  with  concurrency  u>  <  JFj1  for  0  <  e  <  1. 

For  this  we  consider  a  simpler  Write-One  problem  [18]  which  involves  P  processors  and 
a  single  memory  location.  The  objective  is  to  write  1  into  the  memory  location  subject 
to  faults  determined  by  an  on-line  adversary.  The  adversary  will  try  to  maximize  the 
work  while  the  algorithm  will  try  to  minimize  it  by  appropriately  scheduling  the  available 
processors  subject  to  the  constraint  that  only  up  to  \F ie  concurrent  writes  are  permitted. 

We  allow  algorithms  that  can  sense  when  the  location  has  been  written  by  employing 
"oracles”  thus  having  no  need  to  certify  the  write.  Algorithms  without  such  power  can  not 
perform  better  than  oracle  algorithms  and  hence  the  lower  bound  applies  to  them  as  well. 
This  is  because  given  a  non-oracle  algorithm  A  there  is  an  orade  algorithm  B  that  mimics 
A  up  to  the  point  where  A  writes  to  memory.  At  this  point  B  stops  while  A  has  to  continue 
in  order  to  certify  the  write  thus  incurring  additional  work. 
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Since  not  all  processors  can  write  concurrently  the  algorithm  will  need  to  form  “waves” 
of  processors.  The  adversary  will  fail  each  wave  (but  the  last)  right  before  the  processors 
in  the  wave  write  to  the  memory.  This  way  the  adversary  prolongs  the  conclusion  of  the 
process  until  the  very  last  step  while  charging  each  processor  as  much  as  possible.  As 
the  following  lemma  shows,  the  algorithm's  best  answer  to  this  strategy  is  to  use  a  greedy 
approach,  that  is  assign  to  each  wave  as  many  processors  as  possible  within  the  constraints. 

Lemma  3.9.  The  work  of  Write-One  algorithms  is  minimised  when  processors  are  allocated 
greedily. 

Proof.  Let  W\ ,  W2, ....  Wk  be  the  processor  waves  and  let  5  be  the  total  work  of  the 
algorithm.  Assuming  that  the  original  schedule  is  not  greedy,  let  >Vj,  1  <  j  <  k  be 
the  first  wave  that  contains  fewer  processors  than  are  allowed  by  the  jFj*  constraint.  By 
assigning  a  processor  from  Wg  to  W,  for  some  j  <  j1  <  k  we  obtain  a  schedule  with  work 
S'  »  S  -  —  j)  <  S  and  hence  the  original  non -greedy  schedule  is  not  optimal.  □ 

We  can  now  show  the  following: 

Lemma  3.10.  Then  is  no  robust  algorithm  for  the  Write-One  problem  with  write  concur¬ 
rency  u}  <  |F|e.  for  9  <  e  <  1. 

Proof.  Since  jFj  <  F,  where  P  is  the  number  of  initial  processors,  it  suffices  to  show 
that  the  bound  holds  for  the  broader  class  of  Write-One  algorithms  that  allow  upto  P* 
concurrent  writes  per  step.  By  Lemma  3.9  Pt  processors  should  be  allocated  to  each  wave 
and  hence  there  will  be  P/P*  =s  F1"*  steps  (for  simplicity  we  assume  that  F  is  a  multiple 
of  F*  j.  Then  the  work  of  the  algorithm  is  )T£!j  '{P  ~  iP*)  -f-  P*  >  ^  =s  ft(F2”*).  O 

We  can  now  show  that: 

Theorem  3.11.  There  is  no  robust  algorithm  for  the  Write-All  problem  with  write  concur¬ 
rency  u!  <  jFj*.  for  0  <  £  <  1. 

Proof.  Cousider  a  Write-All  algorithm  for  instances  of  si2e  N  with  P  processors.  The 
adversary  picks  one  of  the  array  locations  and  fails  only  processors  that  attempt  to  write  to 
it  according  to  the  strategy  outlined  earlier  in  this  section.  According  to  Lemma  3.10  the 
work  of  the  algorithm  will  be  fl(P2~£).  □ 

of. 6'.  PHAM  Simulations  Subject  to  Dynamic  Faults 

As  discussed  in  Section  2.3.2,  solutions  to  the  Write- All  problem  can  serve  as  building 
blocks  for  efficent  simulations  of  parallel  algorithms  on  fail-stop  PRAMs.  The  algorithms 
presented  above  likewise  can  be  employed  in  fault-tolerant  simulations. 

By  using  algorithm  W^w  we  obtain  efficient  PRAM  simulations  on  fail-stop  PRAMs. 
In  particular,  we  obtain  the  following: 
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Theorem  3.12.  Any  N  processor,  r  parallel  time  EREW'  PRAM  algorithm  can  hi.  simu¬ 
lated  on  a  fail-stop  P -processor  CRCW  PRAM  with  using  wopk  0(r  •  (N  +  |V)) 

so  that  the  write  concurrency  of  the  simulation  is  u  <  jFj  and  the  read  concurrency  is 
p  <  7jF|  log#,  for  any  failure  pattern  F  and  1  <  P  <  N- 

The  work  bound  of  this  theorem  also  holds  for  simulations  of  r.on-EREW  PRAM  algo¬ 
rithms  but  the  concurrency  bounds  depend  on  the  concurrency  of  the  simulated  algorithm. 
If  -.*0  and  pa  are  the  write  and  read  concurrencies  of  the  simulated  algorithm,  respectively, 
then  for  the  simulation  we  have  u  <  jFj  +  ut0  and  p  <  7  jFj  log  N  -f  p0  O(logJV).  Alterna¬ 
tively,  we  can  maintain  the  same  concurrency  bounds  at  the  expense  of  increasing  the  work 
by  a  logarithmic  factor  by  first  converting  the  original  algorithm  into  an  equivalent  EREW 
algorithm  [19]. 

Similar  results  can  be  derived  for  contaminated  initial  memory  using  algorithm  Z  and 
Theorem  3.4  to  initialize  an  appropriate  amount  of  auxiliary  shared  memory  required  by  the 
simulation.  There  is  an  additive  expense  in  woik  of  0(N  +  Flog2  Flog3  A7(loglog  jV'}2). 

Finally,  if  the  faulty  processors  are  restarted  in  the  simulations  so  that  they  are  re- 
synchronized  at  the  beginning  of  the  individual  simulated  PRAM  steps  then  the  simulations 
results  apply  as  well. 

4.  Algorithm  W  with  Bounded  Maximum  Access 

For  the  algorithms  developed  in  the  previous  section,  the  read  and  write  concurrencies 
are  bounded,  but  during  some  time  steps  the  read  and  write  concurrency  might  still  be  as 
high  as  ©(F)  for  large  jFj.  In  this  section,  we  are  going  to  employ  a  pipelining  technique 
to  limit  the  maximum  per  step  concurrency  of  algorithm  W. 

As  a  first  approach  to  limiting  concurrency  we  note  that  phases  Wl,  W2,  and  W4  of 
algorithm  W  involve  traversal  of  trees  of  logarithmic  depth  with  only  one  level  of  the  trees 
being  used  at  any  given  time.  A  solution  that  naturally  suggests  itself  is  to  divide  the 
processors  into  waves  and  have  them  move  through  the  trees  separately  by  introducing  a 
time  delay  between  the  waves,  if  we  choose  the  number  of  waves  carefully  it  is  possible  to 
reduce  the  number  of  concurrent  accesses  without  degrading  the  asymptotic  work  of  the 
algorithm.  In  particular,  since  the  trees  are  of  height  log  N  we  can  divide  the  processors 
into  log.V  waves  each  of  F/  log  N  processors.  During  each  of  the  phases  Wl,  W2,  and  W4 
we  send  the  processors  through  the  appropriate  trees  one  wave  at  a  time  thus  reducing  the 
number  of  concurrent  accesses  to  at  most  F/log  N  for  any  memory  location.  Since  the 
values  computed  at  the  nodes  of  the  heaps  during  each  phase  will  be  correct  only  after 
the  last  wave  of  processors  finishes,  waves  that  finish  earlier  must  wait  for  the  last  wave  to 
arrive  before  beginning  the  next  phase.  This  introduces  a  O(iog  A’)  additive  overhead  per 
phase  doubling  the  running  time  of  each  phase  but  does  not  affect  the  asymptotic  efficiency 
of  the  algorithm. 

For  phases  W2  and  W4,  it  is  not  significant  how  we  divide  the  processors  into  waves. 
However,  phase  Wl  requires  that  processors  be  assigned  to  waves  in  a  left  to  right  fashion, 
i.e..  the  leftmost  log  N  processors  form  the  first  wave,  the  next  log  A'  processors  are  in  the 
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second  wave  and  so  on.  This  is  because  phase  W1  computes  prefix  sums  and  a  processor 
must  not  reach  a  node  of  the  processor  enumeration  tree  before  any  processor  with  smaller 
PID. 

This  pipelining  idea  can  be  generalized  by  allowing  a  greater  number  of  processor  waves. 
In  particular,  for  any  k  >  1  we  can  partition  the  processors  into  log*  N  waves  each  of 
Pj  log*''  N  processors.  Since  there  can  only  be  log  N  active  waves  at  any  time,  each  phase 
will  be  slowed  down  by  a  factor  of  log*"1  N .  This  yields  the  following: 

Theorem  4.1.  Algorithm  ITflog  N]  with  waves  is  a  robust  algorithm  for  the  Write-All  prob¬ 
lem  with  S  =  0(Nlo$k~*  N  +  P  log*+1  N l  log  log  N)  for  any  k  >  1  and  with  at  most 
Pj  log*  N  concurrent  accesses  at  any  time  for  any  memory  location. 

Remark  4.  For  A-  =  1  there  is  no  asymptotic  degradation  in  the  efficiency  of  the  algorithm. 

Even  though  the  pipelining  of  processor  waves  reduces  the  number  of  concurrent  accesses 
for  each  memory  location,  the  algorithm  still  makes  a  lot  of  concurrent  accesses  overall.  For 
example,  for  k  =  1  and  P  =  N  the  algorithm  makes  —  i))  =  ft(jV)  concurrent 

writes  when  no  processors  fail  since  each  wave  makes  —  1  concurrent  writes  at  the  root 
and  there  are  log  A7  such  waves. 

The  use  of  processor  waves  can  be  combined  with  the  PPTs  of  Section  3.2  to  yield  an 
algorithm  that  uses  waves  of  priority  trees  to  control  both  the  per  step  and  the  overall 
concurrency: 

Theorem  4.2.  Algorithm  W^w  with  pipelined  PPTs  is  a  robust  Write-All  algorithm 
with  S  -  0(A'log*_1  N  +  P  log2  Flog*4"1  A7/ log  log  A7)  for  any  k  >  1,  read  concurrency 
p  <  7  \F\  log  A’ .  write  concurrency  w  <  ]F|  and  with  at  most  Pj  log*  N  concurrent  accesses 
at  any  time  for  any  memory  location. 

5.  Write- All  with  Static  Faults 

The  failure  model  we  have  been  considering  so  far  allows  arbitrary  dynamic  faults.  As 
we  have  seen  this  model  is  too  strong  to  allow  efficient  solutions  without  concurrent  memory 
accesses.  In  this  section  we  consider  the  alternative,  weaker,  model  of  static  (initial)  faults 
in  which  failures  can  only  occur  prior  to  the  start  of  an  algorithm.  We  will  show  that  simple 
ami  efficient  solutions  are  possible  in  this  model  without  requiring  concurrent  accesses  and 
without  any  need  to  initialize  ..he  shared  memory. 

We  assume  that  the  size  of  the  Write- All  instances  is  N  and  that  we  have  P  processors, 
P'  <  P  of  which  are  alive  at  the  beginning  of  the  algorithm.  We  do  not  know  P‘  a  priori 
or  which  of  the  processors  are  faulty. 

Our  EREW  algorithm  E  consists  of  two  phases  El  and  E2.  In  phase  El,  processors  enu¬ 
merate  themselves  and  compute  the  total  number  of  live  processors.  With  this  information 
in  phase  £2,  the  processors  partition  the  input  array  so  that  each  processor  is  responsible 
for  setting  to  j  all  the  entries  in  its  partition  (Fig.  9). 
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The  enumeration  phase  El  is  similar  to  ^hase  W1  of  algorithm  W  that  is  based  on  the 
parallel  prefix  computation.  However  W1  --unot  be  used  because  of  its  concurrent  accesses 
and  the  standard  parallel  prefix  algorithm  is  not  adequate  since  it  is  oblivious  and  requires 
that  ah  processors  be  available.  We  present  a  simple  non-oblivious  EREW  enumeration 
algorithm  that  does  not  assume  a  known  number  of  processors  and  that  produces  correct 
results  for  any  pattern  of  static  failures. 

A  detailed  description  of  phase  El  is  given  in  Fig.  10.  It  uses  a  full  binary  tree  of  height 
log  P  for  enumeration,  denoted  T  in  the  figure.  The  processors  begin  with  initial  rank  1  at 
the  leaves  and  move  up  toward  the  root.  The  algorithm  ensures  that  at  each  interior  node 
of  the  tree,  at  most  one  processor  moves  up.  If  two  processors  reach  sibling  nodes  within 
the  tree,  one  of  them  has  to  stop  at  the  current  level  to  avoid  concurrent  accesses  at  the 
parent.  By  convention,  the  processor  that  is  allowed  to  continue  is  the  one  coming  from  the 
right. 

Each  processor  Pi  that  continues  needs  to  be  able  to  communicate  with  the  processors 
that  have  stopped  at  lower  levels  of  the  subtree  rooted  at  the  current  location  of  P,  so  that 
it  can  broadcast  rank  updates  to  them.  To  this  end,  the  stopped  processors  in  P;’s  subtree 
form  a  tree  whose  root  is  the  processor  most  recently  stopped  because  it  encountered  P,. 
Note  that  each  processor  in  this  tree  stopped  because  it  encountered  either  P,  or  some  other 
processor  in  the  tree.  P{  can  broadcast  a  value  to  the  processors  in  this  tree  by  passing  the 
value  to  the  root  of  the  tree  and  then  having  each  processor  pass  the  value  to  its  children. 

To  do  this,  the  algoritum  of  Fig.  10  assumes  that  each  node  of  T  has  two  fields:  val,  which 
is  used  for  broadcasting,  and  right-child,  which  points  to  the  root  of  the  right  subtree  of 
stopped  processors.  We  also  use  left-child  to  point,,  to  the  left  subtree.  Note  that  left-child 
is  a  private  variable,  whereas  right-child  is  a  node  field;  this  is  because  a  processor  P, 
determines  its  left  subtree  by  itself  while  its  right  subtree  is  determined  by  the  processor 
that  stops  P,. 

As  a  processor  moves  up,  it  checks  whether  there  is  any  processor  at  the  sibling  node  by 
using  a  “probing*’  scheme.  The  processor  writes  0  to  its  sibling  node,  then  writes  its  current 
rank  at  its  own  node  and  then  reads  the  sibling  node  (lines  7-9).  If  it  is  0,  then  there  is  no 
processor  at  the  sibling  node,  otherwise  the  value  read  is  the  number  of  processors  in  the 
subtree  rooted  at  the  sibling.  As  in  the  standard  parallel  prefix  the  processor  that  comes 
from  the  right  adds  this  value  to  its  rank  (line  15).  In  our  case,  the  right  processor  also 
initiates  the  broadcasting  of  this  value  to  the  processors  in  its  tree  of  stopped  processors 
by  writing  the  value  to  the  root  of  this  tree  (line  16).  The  trees  of  the  two  processors  that 
meet  are  merged  by  creating  a  new  tree  with  its  root  being  the  processor  that  came  from 


01  forall  processors  PID=1..P  par  begin 

02  Phase  El:  Use  non-oblivious  parallel  prefix  to  compute  rankpw  and  P’ 

03  Phase  E2:  Set  —  1)  •  fir . . .  (ra nkpiv  ■  -pr)  —  l]  to  1 

04  par  end 


Fig.  9.  A  high  level  view  of  algorithm  E. 


pi  currentjrank  ;s=  1  —  each  processor  start;  with  initial  rank  l 
QZieftjchiid  :=  nil  ——  tut  left  ekJMt&i  pi 
03  stopped  :*s  false  — initially  ell  processors  can  shove  tip 
04  pos  :=  P  +  PID  -  2  —  initial  position  *n  the  tree 
05  for  t  :*s  l..iog  P  do 

06  if  Hopped  *  false  ther  processor  can  stilt  mow 

0?  Tfsihling(pos)].val  :*  0  — — |»rabe  sibling;  siblingQ  returns  the  location  of  the  sibling  node 
08  T[pos].vai  :=  current-rimi; 

09  ranktu,  :sc  T[aiWt»ff(pu4)}.t«ti 

10  Jf  ranb^t,  2=  0  then  —— noone  there 

11  jjoj  :s=  \pos/2\  — move  to  parent  node 

12  else  if  current  node  is  a  left  child  then 

13  stopped  :*  true;  rfpejj.val  :=s  0  < — let  the  sibling  processor  continue 

14  else - current  node  is  a  right  child;  processor  continues 

16  ctif re»l-ra»fc  :«  currentjrank  +  rank,,t, 

16  T[Uft.child].val  ;«=  rank, a  — propagate  rank  update 

17  T\siblitig(po$)].right~child  ~  leftjchild  — — modify  trees 

18  Uft.child  :=  7’{«iWing(pos)] 

19  pos  :as  ^s/2j  — — move  to  parent  node 

20  ft 

21  else - processor  has  stopped;  just  propagate  rank  updates 

22  if  T\pos}.val  >  0  then  — there's  a  rank  update  to  propagate 

23  current.rank  ■=  currt.nt.rank  +  T\pos\.val  • — update  own  rank 

24  7|7'[po<].right-cAiM].t>e!  :=  T(pos].val  —— pass  the  update  on  to  the  right  child,  if  there  t*  one 

25  T{pox]val :»  0  — get  ready  for  the  next  iteration 

26  fi  ft 

27  od 

28  - complete  propagating  rank  updates  and  broadcast  the  number  of  live  processors 

29  for  i  :=*  l..iog  P  do 

30  if  stopped  —  false  then  ——this  is  the  processor  that  reached  the  root 

31  r[!f  /f.cht/rf].»a/  :=  —current.rank 

32  else  — stopped  processors  just  propagate  the  values  they  receive 

33  if  Tjposj.ral  >  0  then  — there's  a  rank  update  to  propagate 

34  atrrenl.runk  current.rank  +  Tfrwjtj.val  — update  own  raid 

35  Tripos]. ri0Ai-cbildj.ua/  :=  T^posj.val  — pass  the  update  on  to  the  right  child,  if  there  is  one 

36  T[pos].val 0  — gel  ready  for  the  next  iteration 

37  else  if  T[p»»].val  <  0  then  — this  is  the  number  of  alive  processors 

38  T{left.child\.val  :ss  r[T^>or].r»ffbt.eb«'ld],val  :=  T\pos].val  — propagate  to  both  children 

39  alive  —T\pos].val - aline  is  the  number  of  alive  processors 

40  flfi 

41  od 


Fig.  10.  A  detailed  description  of  phase  El. 


the  left,  its  left  subtree  the  tree  of  the  left  processor  and  its  right  subtree  the  tree  of  the 
right  processor  (lines  17-18).  An  illustration  of  this  process  is  provided  in  Fig.  11  when 
P  =  8. 

When  the  rightmost  live  processor  gets  to  the  root  its  rank  is  the  number  of  live  proces¬ 
sors  and  this  value  is  propagated  through  the  tree  down  to  the  other  processors  (lines  29-41). 
In  the  algorithm  of  Fig.  10  we  broadcast  this  value  as  a  negative  number  to  distinguish  it 
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(a)  (b) 


(c)  (d) 


Fig.  1 1 .  An  illustration  of  phase  El  with  8  processors  of  which  4  have  failed.  Circles  indicate 
processors  that  have  stopped  their  ascend  because  of  encounters  with  other  live  processors 
and  squares  indicate  processors  that  continue  their  upward  movement.  The  numbers-  beside 
the  processors  indicate  their  ranks  at  each  stage  of  the  algorithm. 

from  the  Tank  updates  whid  are  posStSve.  The  whole  process  takes  2  log  P  steps. 

The  algorithm  can  be  made  to  work  in  place  by  collapsing  the  levels  of  the  tree  into  a 
single  linear  array.  In  addition,  because  of  tbe  probing  scheme,  the  initial  values  of  the  tree 
are  irrelevant  and  so  the  algorithm  works  even  in  the  presence  of  initial  memory  faults. 

In  phase  E2,  knowing  the  number  of  live  processors  and  their  ranks,  processors  partition 
the  input  and  each  sets  to  1  the  entries  in  its  own  part. 

Theorem  5.1.  The  Write-All  problem  with  initial  processor  and  memory  faults  can  be 
solved  in  place  with  S  =  0(N  -f  P*  log  P)  on  an  EREW  PRAM,  where  1  <  P  <  N  and 
P  —  P1  is  the  number  of  initial  faults. 

With  the  result  of  [5]  it  can  be  shown  that  this  algorithm  is  optimal. 

Algorithm  E  can  be  used  for  simulations  of  PRAM  algorithms  on  fail-stop  PRAMs  that 
are  subject  to  static  initial  processor  and  memory  faults.  Simulations  are  much  simpler  for 
this  case  as  compared  to  the  dynamic  failures  case. 

In  particular,  only  the  enumeration  phase  El  of  algorithm  E  is  needed.  At  the  beginning 
of  the  simulation  we  use  the  enumeration  algorithm  to  determine  the  number  P'  of  live 
processors.  We  then  use  Brent’s  Lemma  [6]  to  simulate  the  P  processors  of  the  original 
algorithm  on  the  P'  available  processors.  The  time  overhead  of  the  simulation  in  this  case 
is  just  O(log  P)  for  the  enumeration  and  the  woTk  overhead  is  O(P'logP).  Both  overheads 
are  additive.  Thus  we  have  the  following: 
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Theorem  5.2.  Any  P-processor,  r  parallel  time  PRAM  algorithm  can  be  simulated  on  a 
fail-stop  PRAM  that  is  subject  to  static  initial  processor  and  memory  faults.  The  work  of  the 
simulation  is  P  •  r  +  0(P'  log  P),  where  P*  is  the  number  of  live  processors.  The  simulating 
PRAM  is  of  the  same  type  as  the  PRAM  for  which  the  simulated  algorithm  is  written  and 
the  read  and  write  concurrencies  of  the  simulation  are  no  more  than  those  of  the  simulated 
algorithm. 

Note  that  when  P  •  r  =  n(JP/log  P),  the  simulation  is  optimal, 

6.  Conclusions  and  Open  Problems 

We  have  shown  that  it  is  possible  to  obtain  robust  solutions  to  the  Write- All  problem 
that  make  concurrent  accesses  only  subsequent  to  processor  failures  and  then  in  a  controlled 
fashion.  We  have  further  demonstrated  that  we  cannot  reduce  the  write  concurrency  much 
further. 

An  open  question  is  whether  we  can  improve  the  read  concurrency  by  eliminating  the 
logarithmic  factor  from  p. 

A  second  open  problem  is  whether  there  is  a  faster  algorithm  than  W  for  the  Write- All 
and  whether  the  methods  presented  here  for  controlling  memory  accesses  can  be  applied  to 
it. 

In  [22]  an  Q(N  log  N)  lower  bound  was  shown  for  the  Write- AD  problem  but  no  known 
deterministic  algorithm  attains  it.  A  third  open  problem  is  whether  the  lower  bound  of  [22] 
applies  to  the  static  case.  It  is  interesting  to  consider  whether  there  is  a  CRCW  algorithm 
for  the  static  Write- AD  that  requires  o( N  log  N )  work. 
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